原文:http://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
这是对pandas的简短介绍,主要面向新用户。您可以在Cookbook中看到更复杂的食谱。通常,我们导入如下:
- 引入pandas和numpy模块
import pandas as pd
import numpy as np
一、对象创建
- 创建一个Series通过传递值的列表,让大熊猫创建一个默认的整数索引:
>>> s = pd.Series([1, 3, 5, np.nan, 6, 8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
- DataFrame通过传递带有日期时间索引和标记列的NumPy数组来创建:
>>> dates = pd.date_range('20130101', periods=6)
>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 0.745164 -1.071123 0.075098 -0.451948
2013-01-02 0.776631 -0.465462 0.272682 0.325622
2013-01-03 -0.103572 -0.291292 -0.845716 -0.698609
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
2013-01-06 -1.717958 -1.077913 -0.416726 0.190645
- DataFrame通过传递可以转换为类似系列的对象的dict来创建。
>>> df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
>>> df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
结果的列DataFrame具有不同的 dtypes。
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
二、查看数据
查看基础部分
1、查看前n行和后n行
>>> df.head() # df.head(n)查看数据框df的前n行数据,默认n=5
A B C D
2013-01-01 0.745164 -1.071123 0.075098 -0.451948
2013-01-02 0.776631 -0.465462 0.272682 0.325622
2013-01-03 -0.103572 -0.291292 -0.845716 -0.698609
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
>>> df.tail(3) # df.tail(n)查看数据框df的后n行数据,默认n=5
A B C D
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
2013-01-06 -1.717958 -1.077913 -0.416726 0.190645
2、查看索引和列标签
>>> df.index # 查看索引
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df.columns # 查看列标签
Index(['A', 'B', 'C', 'D'], dtype='object')
3、DataFrame.to_numpy
DataFrame.to_numpy()函数将DataFrame数据转化为NumPy的基础数据格式。
注意:当DataFrame拥有不同数据类型的列时,它的操作可能比较耗时。这可归结为pandas和NumPy之间的根本差异:NumPy的数组只有一个dtype,而pandas的DataFrames每列都有一个dtype,当你使用DataFrame.to_numpy()时,pandas会找到可以容纳DataFrame中所有dtypes的NumPy dtype,这可能最终成为object,这需要将每个值都转换为Python对象,比较耗时。
>>> df.to_numpy() # 对于df,DataFrame的所有dtype都是浮点型数值,DataFrame.to_numpy()会很快
array([[ 0.74516421, -1.07112293, 0.07509773, -0.45194753],
[ 0.77663096, -0.46546163, 0.27268217, 0.32562184],
[-0.10357239, -0.29129193, -0.84571597, -0.69860938],
[ 0.65574123, -0.73156903, -0.72771026, -1.11328066],
[-0.52707953, -0.81557927, -0.29516127, -1.16619946],
[-1.71795821, -1.07791279, -0.41672553, 0.190645 ]])
>>>
>>> df2.to_numpy() # 对于df2,DataFrame具有多个不同dtypes,DataFrame.to_numpy()会比较耗时
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
注意:DataFrame.to_numpy()输出的结果中不包含索引和列标签
4、查看数据的统计摘要
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.028512 -0.742156 -0.322922 -0.485628
std 0.982188 0.318212 0.438155 0.635467
min -1.717958 -1.077913 -0.845716 -1.166199
25% -0.421203 -1.007237 -0.649964 -1.009613
50% 0.276084 -0.773574 -0.355943 -0.575278
75% 0.722808 -0.531988 -0.017467 0.029997
max 0.776631 -0.291292 0.272682 0.325622
5、转置
df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.745164 0.776631 -0.103572 0.655741 -0.527080 -1.717958
B -1.071123 -0.465462 -0.291292 -0.731569 -0.815579 -1.077913
C 0.075098 0.272682 -0.845716 -0.727710 -0.295161 -0.416726
D -0.451948 0.325622 -0.698609 -1.113281 -1.166199 0.190645
6、按轴排序
>>> # axis=1是按照列标签排序,默认axis=0是按照索引排序
>>> # ascending=False是降序排序,默认ascending=True是按照升序排序
>>> df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 -0.451948 0.075098 -1.071123 0.745164
2013-01-02 0.325622 0.272682 -0.465462 0.776631
2013-01-03 -0.698609 -0.845716 -0.291292 -0.103572
2013-01-04 -1.113281 -0.727710 -0.731569 0.655741
2013-01-05 -1.166199 -0.295161 -0.815579 -0.527080
2013-01-06 0.190645 -0.416726 -1.077913 -1.71795
7、按值排序
>>> # by: 按照值排序。如果axis=0,by是列标签名;axis=1,by是索引名
>>> # ascending: 升降序排序
>>> df.sort_values(by='B', axis=0, ascending=False)
A B C D
2013-01-03 -0.103572 -0.291292 -0.845716 -0.698609
2013-01-02 0.776631 -0.465462 0.272682 0.325622
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
2013-01-01 0.745164 -1.071123 0.075098 -0.451948
2013-01-06 -1.717958 -1.077913 -0.416726 0.190645
三、选择数据
注意:虽然用于选择和设置的标准Python/Numpy表达式非常直观并且对于交互式工作非常方便,但对于生产代码,我们建议使用优化的pandas数据访问方法 .at,.iat,.loc和.iloc。
更高级索引文档查看 Indexing and Selecting Data 和 MultiIndex / Advanced Indexing.
1、使用中括号选择数据
>>> df['A'] # 选择一个列,等价于df.A,返回一个Series
2013-01-01 0.847323
2013-01-02 -0.638694
2013-01-03 -0.859098
2013-01-04 1.425489
2013-01-05 -1.398528
2013-01-06 1.047252
Freq: D, Name: A, dtype: float64
>>> df[0:3] # 选择一个行,等价于df['2013-01-01':'2013-01-03']
A B C D
2013-01-01 0.847323 1.242429 -0.339945 -1.625278
2013-01-02 -0.638694 1.303991 1.299221 0.980656
2013-01-03 -0.859098 -0.858955 -0.492993 -0.236749
2、按照标签选择数据
更多信息查看这里。 用法:df.loc[行标签, 列标签]
>>> df.loc[:, ['A','C']] # 选择A、C两列数据
A C
2013-01-01 0.847323 -0.339945
2013-01-02 -0.638694 1.299221
2013-01-03 -0.859098 -0.492993
2013-01-04 1.425489 -0.460793
2013-01-05 -1.398528 0.309689
2013-01-06 1.047252 -1.635620
>>> df.loc[:, 'A':'C'] # 选择A到C列数据
A B C
2013-01-01 0.847323 1.242429 -0.339945
2013-01-02 -0.638694 1.303991 1.299221
2013-01-03 -0.859098 -0.858955 -0.492993
2013-01-04 1.425489 0.687216 -0.460793
2013-01-05 -1.398528 1.607066 0.309689
2013-01-06 1.047252 0.131745 -1.635620
>>> df.loc[dates[0],] # 按照行标签选择行2013-01-01的数据
A 0.847323
B 1.242429
C -0.339945
D -1.625278
Name: 2013-01-01 00:00:00, dtype: float64
>>> df.loc[dates[0:2],'A':'C'] # 选择'2013-01-01'到'2013-01-02',A到C列的数据
A B C
2013-01-01 0.847323 1.242429 -0.339945
2013-01-02 -0.638694 1.303991 1.299221
>>> df.loc[dates[0],'C'] # 选择第一行C列的数,单一个数
-0.3399446074291146
>>> df.at[dates[0],'C'] # 等价于df.loc[dates[0],'C'],但是df.at速度更快,只能获取一个数
-0.3399446074291146
3、按照位置选择数据
更多信息查看这里。 用法:df.iloc[行位置, 列位置]
>>> df.iloc[1] # 按照行索引,选取第2行数据,等价于df.iloc[1,:]
A -0.638694
B 1.303991
C 1.299221
D 0.980656
Name: 2013-01-02 00:00:00, dtype: float64
>>> df.iloc[0:2, [0,1,3]] # 选取第1、2行,第1、2、4列数据
A B D
2013-01-01 0.847323 1.242429 -1.625278
2013-01-02 -0.638694 1.303991 0.980656
>>> df.iloc[1,0] # 选取第2行,第1列数据
-0.638694268332326
>>> df.iat[1,0] # 等价于df.iloc[1,0],但是df.iat速度更快,只能获取一个数
-0.638694268332326
4、布尔索引
- 使用单个列的值来选择数据。
>>> df[df.A > 0]
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
- 从满足布尔条件的DataFrame中选择值。
>>> df[df > 0]
A B C D
2013-01-01 0.469112 NaN NaN NaN
2013-01-02 1.212112 NaN 0.119209 NaN
2013-01-03 NaN NaN NaN 1.071804
2013-01-04 0.721555 NaN NaN 0.271860
2013-01-05 NaN 0.567020 0.276232 NaN
2013-01-06 NaN 0.113648 NaN 0.524988
- 使用isin()过滤方法:
>>> df2 = df.copy()
>>> df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
>>> df2
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three
>>> df2[df2['E'].isin(['two', 'four'])]
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
5、设置值
- 设置新列会自动根据索引对齐数据。
>>> s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
>>> s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
>>> df['F'] = s1
>>> df
A B C D F
2013-01-01 0.147348 1.196578 -0.830143 -0.819528 NaN
2013-01-02 -0.323415 0.273535 0.591451 -0.455048 1.0
2013-01-03 -0.915444 1.683675 -1.005351 0.036724 2.0
2013-01-04 -0.133637 0.604580 -1.344744 0.654401 3.0
2013-01-05 1.631007 0.524544 -1.103851 -0.834705 4.0
2013-01-06 -1.668588 1.298758 -1.557362 -1.177257 5.0
- 按标签设置值:
>>> df.at[dates[0], 'A'] = 0
- 按位置设置值:
>>> df.iat[0, 1] = 0
- 通过使用NumPy数组进行设置:
>>> df.loc[:, 'D'] = np.array([5] * len(df))
- 先前设置操作的结果:
>>> df
A B C D F
2013-01-01 0.000000 0.000000 -0.830143 5 NaN
2013-01-02 -0.323415 0.273535 0.591451 5 1.0
2013-01-03 -0.915444 1.683675 -1.005351 5 2.0
2013-01-04 -0.133637 0.604580 -1.344744 5 3.0
2013-01-05 1.631007 0.524544 -1.103851 5 4.0
2013-01-06 -1.668588 1.298758 -1.557362 5 5.0
四、缺失值处理
pandas主要使用该值np.nan来表示缺失的数据。默认情况下,它不包含在计算中。请参阅缺失数据部分。 - 重建索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。
>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
>>> df1.iloc[0:2, -1] = 1
>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 -0.830143 5 NaN 1.0
2013-01-02 -0.323415 0.273535 0.591451 5 1.0 1.0
2013-01-03 -0.915444 1.683675 -1.005351 5 2.0 NaN
2013-01-04 -0.133637 0.604580 -1.344744 5 3.0 NaN
- 删除任何缺少数据的行:
>>> df1.dropna(how='any')
A B C D F E
2013-01-02 -0.323415 0.273535 0.591451 5 1.0 1.0
- 填写缺失的数据:
>>> df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 -0.830143 5 5.0 1.0
2013-01-02 -0.323415 0.273535 0.591451 5 1.0 1.0
2013-01-03 -0.915444 1.683675 -1.005351 5 2.0 5.0
2013-01-04 -0.133637 0.604580 -1.344744 5 3.0 5.0
- 获取值所在的布尔掩码nan
>>> pd.isna(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
555
(select 198766*667891 from DUAL)
(select 198766*667891)
@@hXaIp
555����%2527%2522\'\"
555'"
555
555'||DBMS_PIPE.RECEIVE_MESSAGE(CHR(98)||CHR(98)||CHR(98),15)||'
555*DBMS_PIPE.RECEIVE_MESSAGE(CHR(99)||CHR(99)||CHR(99),15)
555xaucFR1G')) OR 710=(SELECT 710 FROM PG_SLEEP(15))--
555ZcyOhrFJ') OR 47=(SELECT 47 FROM PG_SLEEP(15))--
555A3fpGQjU' OR 830=(SELECT 830 FROM PG_SLEEP(15))--
555-1)) OR 241=(SELECT 241 FROM PG_SLEEP(15))--
555-1) OR 772=(SELECT 772 FROM PG_SLEEP(15))--
555-1 OR 886=(SELECT 886 FROM PG_SLEEP(15))--
555tEKDtIjD'
555-1 waitfor delay '0:0:15' --
555-1)
555-1
(select(0)from(select(sleep(15)))v)/*'+(select(0)from(select(sleep(15)))v)+'"+(select(0)from(select(sleep(15)))v)+"*/
5550"XOR(555*if(now()=sysdate(),sleep(15),0))XOR"Z
5550'XOR(555*if(now()=sysdate(),sleep(15),0))XOR'Z
555*if(now()=sysdate(),sleep(15),0)
-1" OR 5*5=25 or "ioEVT1nN"="
-1' OR 5*5=25 or 'XUQ4l7dS'='
-1" OR 5*5=25 --
-1' OR 5*5=25 --
-1 OR 5*5=25
-1 OR 5*5=25 --
555
555
555