有许多重新排列表格型数据的基础运算,这些函数称为重塑(reshape)或轴向旋转(pivot)
重塑层次化索引
data = DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'], name='number'))
stack降维
result = data.stack()
result
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int64
unstack升维
result.unstack()
默认,unstack操作的是最内层索引。传入分层的编号或名称进行unstack
result.unstack(0)
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
或者是
result.unstack('state')
unstack引入缺失值
如果不是所有级别的值都能在分组找到,unstack会引入缺失数据
s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2.unstack()
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
stack默认会滤出缺失数据
data2.unstack().stack()
data2.unstack().stack(dropna=False)
在对DataFrame进行unstack操作时,作为旋转轴的级别,僵尸结果中最低级别
df = DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
df.unstack('state')
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
df.unstack('state').stack('side')
将“长格式”旋转为“宽格式”
时间序列数据通常是长格式或者是堆叠格式,存储在数据库或CSV中
ldata[:10]
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300
9 1959-12-31 realgdp 2785.204
将不同的item值分别形成一列,并以date列作为索引
pivoted = ldata.pivot('date', 'item', 'value')
pivoted.head()
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
前两个参数值分别用作行和列索引的列名,最后一个参数用于填充DataFrame的数据。
假设有两个需要参与重塑的数据列
ldata['value2'] = np.random.randn(len(ldata))
ldata[:10]
date item value value2
0 1959-03-31 realgdp 2710.349 -0.204708
1 1959-03-31 infl 0.000 0.478943
2 1959-03-31 unemp 5.800 -0.519439
3 1959-06-30 realgdp 2778.801 -0.555730
4 1959-06-30 infl 2.340 1.965781
5 1959-06-30 unemp 5.100 1.393406
6 1959-09-30 realgdp 2775.488 0.092908
7 1959-09-30 infl 2.740 0.281746
8 1959-09-30 unemp 5.300 0.769023
9 1959-12-31 realgdp 2785.204 1.246435
如果忽略最后一个参数,得到的DataFrame就会带有层次化的列
pivoted = ldata.pivot('date', 'item')
pivoted[:5]
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 0.478943 -0.204708 -0.519439
1959-06-30 2.34 2778.801 5.1 1.965781 -0.555730 1.393406
1959-09-30 2.74 2775.488 5.3 0.281746 0.092908 0.769023
1959-12-31 0.27 2785.204 5.6 1.007189 1.246435 -1.296221
1960-03-31 2.31 2847.699 5.2 0.228913 0.274992 1.352917
pivot只是一个快捷方式,使用set_index创建层次化索引,再使用unstack重塑
unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked[:7]