有许多重新排列表格型数据的基础运算,这些函数称为重塑(reshape)或轴向旋转(pivot)

重塑层次化索引

data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))

stack降维

result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

unstack升维

result.unstack()

默认,unstack操作的是最内层索引。传入分层的编号或名称进行unstack

result.unstack(0)

state    Ohio    Colorado
number        
one    0    3
two    1    4
three    2    5

或者是

result.unstack('state')

unstack引入缺失值

如果不是所有级别的值都能在分组找到,unstack会引入缺失数据

s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2.unstack()

    a    b    c    d    e
one    0.0    1.0    2.0    3.0    NaN
two    NaN    NaN    4.0    5.0    6.0

stack默认会滤出缺失数据

data2.unstack().stack()
data2.unstack().stack(dropna=False)

在对DataFrame进行unstack操作时,作为旋转轴的级别,僵尸结果中最低级别

df = DataFrame({'left': result, 'right': result + 5},
               columns=pd.Index(['left', 'right'], name='side'))
df

    side    left    right
state    number        
Ohio    one    0    5
two    1    6
three    2    7
Colorado    one    3    8
two    4    9
three    5    10
df.unstack('state')


side    left    right
state    Ohio    Colorado    Ohio    Colorado
number                
one    0    3            5    8
two    1    4            6    9
three    2    5            7    10
df.unstack('state').stack('side')

将“长格式”旋转为“宽格式”

时间序列数据通常是长格式或者是堆叠格式,存储在数据库或CSV中

ldata[:10]

    date           item    value
0    1959-03-31    realgdp    2710.349
1    1959-03-31    infl    0.000
2    1959-03-31    unemp    5.800
3    1959-06-30    realgdp    2778.801
4    1959-06-30    infl    2.340
5    1959-06-30    unemp    5.100
6    1959-09-30    realgdp    2775.488
7    1959-09-30    infl    2.740
8    1959-09-30    unemp    5.300
9    1959-12-31    realgdp    2785.204

将不同的item值分别形成一列,并以date列作为索引

pivoted = ldata.pivot('date', 'item', 'value')
pivoted.head()

item           infl    realgdp           unemp
date            
1959-03-31    0.00    2710.349    5.8
1959-06-30    2.34    2778.801    5.1
1959-09-30    2.74    2775.488    5.3
1959-12-31    0.27    2785.204    5.6
1960-03-31    2.31    2847.699    5.2

前两个参数值分别用作行和列索引的列名,最后一个参数用于填充DataFrame的数据。

假设有两个需要参与重塑的数据列

ldata['value2'] = np.random.randn(len(ldata))
ldata[:10]


        date            item    value    value2
0    1959-03-31    realgdp    2710.349    -0.204708
1    1959-03-31    infl    0.000    0.478943
2    1959-03-31    unemp    5.800    -0.519439
3    1959-06-30    realgdp    2778.801    -0.555730
4    1959-06-30    infl    2.340    1.965781
5    1959-06-30    unemp    5.100    1.393406
6    1959-09-30    realgdp    2775.488    0.092908
7    1959-09-30    infl    2.740    0.281746
8    1959-09-30    unemp    5.300    0.769023
9    1959-12-31    realgdp    2785.204    1.246435

如果忽略最后一个参数,得到的DataFrame就会带有层次化的列

pivoted = ldata.pivot('date', 'item')
pivoted[:5]

                value                    value2
item        infl    realgdp        unemp    infl        realgdp        unemp
date                        
1959-03-31    0.00    2710.349    5.8    0.478943    -0.204708    -0.519439
1959-06-30    2.34    2778.801    5.1    1.965781    -0.555730    1.393406
1959-09-30    2.74    2775.488    5.3    0.281746    0.092908    0.769023
1959-12-31    0.27    2785.204    5.6    1.007189    1.246435    -1.296221
1960-03-31    2.31    2847.699    5.2    0.228913    0.274992    1.352917

pivot只是一个快捷方式,使用set_index创建层次化索引,再使用unstack重塑

unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked[:7]

results matching ""

    No results matching ""