原文:http://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
这是对pandas的简短介绍,主要面向新用户。您可以在Cookbook中看到更复杂的食谱。通常,我们导入如下:
- 引入pandas和numpy模块
import pandas as pd
import numpy as np
一、对象创建
- 创建一个Series通过传递值的列表,让大熊猫创建一个默认的整数索引:
>>> s = pd.Series([1, 3, 5, np.nan, 6, 8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
- DataFrame通过传递带有日期时间索引和标记列的NumPy数组来创建:
>>> dates = pd.date_range('20130101', periods=6)
>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 0.745164 -1.071123 0.075098 -0.451948
2013-01-02 0.776631 -0.465462 0.272682 0.325622
2013-01-03 -0.103572 -0.291292 -0.845716 -0.698609
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
2013-01-06 -1.717958 -1.077913 -0.416726 0.190645
- DataFrame通过传递可以转换为类似系列的对象的dict来创建。
>>> df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
>>> df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
结果的列DataFrame具有不同的 dtypes。
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
二、查看数据
查看基础部分
1、查看前n行和后n行
>>> df.head() # df.head(n)查看数据框df的前n行数据,默认n=5
A B C D
2013-01-01 0.745164 -1.071123 0.075098 -0.451948
2013-01-02 0.776631 -0.465462 0.272682 0.325622
2013-01-03 -0.103572 -0.291292 -0.845716 -0.698609
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
>>> df.tail(3) # df.tail(n)查看数据框df的后n行数据,默认n=5
A B C D
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
2013-01-06 -1.717958 -1.077913 -0.416726 0.190645
2、查看索引和列标签
>>> df.index # 查看索引
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df.columns # 查看列标签
Index(['A', 'B', 'C', 'D'], dtype='object')
3、DataFrame.to_numpy
DataFrame.to_numpy()函数将DataFrame数据转化为NumPy的基础数据格式。
注意:当DataFrame拥有不同数据类型的列时,它的操作可能比较耗时。这可归结为pandas和NumPy之间的根本差异:NumPy的数组只有一个dtype,而pandas的DataFrames每列都有一个dtype,当你使用DataFrame.to_numpy()时,pandas会找到可以容纳DataFrame中所有dtypes的NumPy dtype,这可能最终成为object,这需要将每个值都转换为Python对象,比较耗时。
>>> df.to_numpy() # 对于df,DataFrame的所有dtype都是浮点型数值,DataFrame.to_numpy()会很快
array([[ 0.74516421, -1.07112293, 0.07509773, -0.45194753],
[ 0.77663096, -0.46546163, 0.27268217, 0.32562184],
[-0.10357239, -0.29129193, -0.84571597, -0.69860938],
[ 0.65574123, -0.73156903, -0.72771026, -1.11328066],
[-0.52707953, -0.81557927, -0.29516127, -1.16619946],
[-1.71795821, -1.07791279, -0.41672553, 0.190645 ]])
>>>
>>> df2.to_numpy() # 对于df2,DataFrame具有多个不同dtypes,DataFrame.to_numpy()会比较耗时
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
注意:DataFrame.to_numpy()输出的结果中不包含索引和列标签
4、查看数据的统计摘要
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.028512 -0.742156 -0.322922 -0.485628
std 0.982188 0.318212 0.438155 0.635467
min -1.717958 -1.077913 -0.845716 -1.166199
25% -0.421203 -1.007237 -0.649964 -1.009613
50% 0.276084 -0.773574 -0.355943 -0.575278
75% 0.722808 -0.531988 -0.017467 0.029997
max 0.776631 -0.291292 0.272682 0.325622
5、转置
df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.745164 0.776631 -0.103572 0.655741 -0.527080 -1.717958
B -1.071123 -0.465462 -0.291292 -0.731569 -0.815579 -1.077913
C 0.075098 0.272682 -0.845716 -0.727710 -0.295161 -0.416726
D -0.451948 0.325622 -0.698609 -1.113281 -1.166199 0.190645
6、按轴排序
>>> # axis=1是按照列标签排序,默认axis=0是按照索引排序
>>> # ascending=False是降序排序,默认ascending=True是按照升序排序
>>> df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 -0.451948 0.075098 -1.071123 0.745164
2013-01-02 0.325622 0.272682 -0.465462 0.776631
2013-01-03 -0.698609 -0.845716 -0.291292 -0.103572
2013-01-04 -1.113281 -0.727710 -0.731569 0.655741
2013-01-05 -1.166199 -0.295161 -0.815579 -0.527080
2013-01-06 0.190645 -0.416726 -1.077913 -1.71795
7、按值排序
>>> # by: 按照值排序。如果axis=0,by是列标签名;axis=1,by是索引名
>>> # ascending: 升降序排序
>>> df.sort_values(by='B', axis=0, ascending=False)
A B C D
2013-01-03 -0.103572 -0.291292 -0.845716 -0.698609
2013-01-02 0.776631 -0.465462 0.272682 0.325622
2013-01-04 0.655741 -0.731569 -0.727710 -1.113281
2013-01-05 -0.527080 -0.815579 -0.295161 -1.166199
2013-01-01 0.745164 -1.071123 0.075098 -0.451948
2013-01-06 -1.717958 -1.077913 -0.416726 0.190645
三、选择数据
注意:虽然用于选择和设置的标准Python/Numpy表达式非常直观并且对于交互式工作非常方便,但对于生产代码,我们建议使用优化的pandas数据访问方法 .at,.iat,.loc和.iloc。
更高级索引文档查看 Indexing and Selecting Data 和 MultiIndex / Advanced Indexing.
1、使用中括号选择数据
>>> df['A'] # 选择一个列,等价于df.A,返回一个Series
2013-01-01 0.847323
2013-01-02 -0.638694
2013-01-03 -0.859098
2013-01-04 1.425489
2013-01-05 -1.398528
2013-01-06 1.047252
Freq: D, Name: A, dtype: float64
>>> df[0:3] # 选择一个行,等价于df['2013-01-01':'2013-01-03']
A B C D
2013-01-01 0.847323 1.242429 -0.339945 -1.625278
2013-01-02 -0.638694 1.303991 1.299221 0.980656
2013-01-03 -0.859098 -0.858955 -0.492993 -0.236749
2、按照标签选择数据
更多信息查看这里。 用法:df.loc[行标签, 列标签]
>>> df.loc[:, ['A','C']] # 选择A、C两列数据
A C
2013-01-01 0.847323 -0.339945
2013-01-02 -0.638694 1.299221
2013-01-03 -0.859098 -0.492993
2013-01-04 1.425489 -0.460793
2013-01-05 -1.398528 0.309689
2013-01-06 1.047252 -1.635620
>>> df.loc[:, 'A':'C'] # 选择A到C列数据
A B C
2013-01-01 0.847323 1.242429 -0.339945
2013-01-02 -0.638694 1.303991 1.299221
2013-01-03 -0.859098 -0.858955 -0.492993
2013-01-04 1.425489 0.687216 -0.460793
2013-01-05 -1.398528 1.607066 0.309689
2013-01-06 1.047252 0.131745 -1.635620
>>> df.loc[dates[0],] # 按照行标签选择行2013-01-01的数据
A 0.847323
B 1.242429
C -0.339945
D -1.625278
Name: 2013-01-01 00:00:00, dtype: float64
>>> df.loc[dates[0:2],'A':'C'] # 选择'2013-01-01'到'2013-01-02',A到C列的数据
A B C
2013-01-01 0.847323 1.242429 -0.339945
2013-01-02 -0.638694 1.303991 1.299221
>>> df.loc[dates[0],'C'] # 选择第一行C列的数,单一个数
-0.3399446074291146
>>> df.at[dates[0],'C'] # 等价于df.loc[dates[0],'C'],但是df.at速度更快,只能获取一个数
-0.3399446074291146
3、按照位置选择数据
更多信息查看这里。 用法:df.iloc[行位置, 列位置]
>>> df.iloc[1] # 按照行索引,选取第2行数据,等价于df.iloc[1,:]
A -0.638694
B 1.303991
C 1.299221
D 0.980656
Name: 2013-01-02 00:00:00, dtype: float64
>>> df.iloc[0:2, [0,1,3]] # 选取第1、2行,第1、2、4列数据
A B D
2013-01-01 0.847323 1.242429 -1.625278
2013-01-02 -0.638694 1.303991 0.980656
>>> df.iloc[1,0] # 选取第2行,第1列数据
-0.638694268332326
>>> df.iat[1,0] # 等价于df.iloc[1,0],但是df.iat速度更快,只能获取一个数
-0.638694268332326
4、布尔索引
- 使用单个列的值来选择数据。
>>> df[df.A > 0]
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
- 从满足布尔条件的DataFrame中选择值。
>>> df[df > 0]
A B C D
2013-01-01 0.469112 NaN NaN NaN
2013-01-02 1.212112 NaN 0.119209 NaN
2013-01-03 NaN NaN NaN 1.071804
2013-01-04 0.721555 NaN NaN 0.271860
2013-01-05 NaN 0.567020 0.276232 NaN
2013-01-06 NaN 0.113648 NaN 0.524988
- 使用isin()过滤方法:
>>> df2 = df.copy()
>>> df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
>>> df2
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three
>>> df2[df2['E'].isin(['two', 'four'])]
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
5、设置值
- 设置新列会自动根据索引对齐数据。
>>> s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
>>> s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
>>> df['F'] = s1
>>> df
A B C D F
2013-01-01 0.147348 1.196578 -0.830143 -0.819528 NaN
2013-01-02 -0.323415 0.273535 0.591451 -0.455048 1.0
2013-01-03 -0.915444 1.683675 -1.005351 0.036724 2.0
2013-01-04 -0.133637 0.604580 -1.344744 0.654401 3.0
2013-01-05 1.631007 0.524544 -1.103851 -0.834705 4.0
2013-01-06 -1.668588 1.298758 -1.557362 -1.177257 5.0
- 按标签设置值:
>>> df.at[dates[0], 'A'] = 0
- 按位置设置值:
>>> df.iat[0, 1] = 0
- 通过使用NumPy数组进行设置:
>>> df.loc[:, 'D'] = np.array([5] * len(df))
- 先前设置操作的结果:
>>> df
A B C D F
2013-01-01 0.000000 0.000000 -0.830143 5 NaN
2013-01-02 -0.323415 0.273535 0.591451 5 1.0
2013-01-03 -0.915444 1.683675 -1.005351 5 2.0
2013-01-04 -0.133637 0.604580 -1.344744 5 3.0
2013-01-05 1.631007 0.524544 -1.103851 5 4.0
2013-01-06 -1.668588 1.298758 -1.557362 5 5.0
四、缺失值处理
pandas主要使用该值np.nan来表示缺失的数据。默认情况下,它不包含在计算中。请参阅缺失数据部分。 - 重建索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。
>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
>>> df1.iloc[0:2, -1] = 1
>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 -0.830143 5 NaN 1.0
2013-01-02 -0.323415 0.273535 0.591451 5 1.0 1.0
2013-01-03 -0.915444 1.683675 -1.005351 5 2.0 NaN
2013-01-04 -0.133637 0.604580 -1.344744 5 3.0 NaN
- 删除任何缺少数据的行:
>>> df1.dropna(how='any')
A B C D F E
2013-01-02 -0.323415 0.273535 0.591451 5 1.0 1.0
- 填写缺失的数据:
>>> df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 -0.830143 5 5.0 1.0
2013-01-02 -0.323415 0.273535 0.591451 5 1.0 1.0
2013-01-03 -0.915444 1.683675 -1.005351 5 2.0 5.0
2013-01-04 -0.133637 0.604580 -1.344744 5 3.0 5.0
- 获取值所在的布尔掩码nan
>>> pd.isna(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True