[toc]
简介
本文将会解说Pandas中根本的数据类型Series和DataFrame,并具体解说这两种类型的创立,索引等根本行为。
应用Pandas须要援用上面的lib:
In [1]: import numpy as npIn [2]: import pandas as pd
Series
Series是一维带label和index的数组。咱们应用上面的办法来创立一个Series:
>>> s = pd.Series(data, index=index)
这里的data能够是Python的字典,np的ndarray,或者一个标量。
index是一个横轴label的list。接下来咱们别离来看下怎么创立Series。
从ndarray创立
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])sOut[67]: a -1.300797b -2.044172c -1.170739d -0.445290e 1.208784dtype: float64
应用index获取index:
s.indexOut[68]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
从dict创立
d = {'b': 1, 'a': 0, 'c': 2}pd.Series(d)Out[70]: a 0b 1c 2dtype: int64
从标量创立
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])Out[71]: a 5.0b 5.0c 5.0d 5.0e 5.0dtype: float64
Series 和 ndarray
Series和ndarray是很相似的,在Series中应用index数值体现的就像ndarray:
s[0]Out[72]: -1.3007972194268396s[:3]Out[73]: a -1.300797b -2.044172c -1.170739dtype: float64s[s > s.median()]Out[74]: d -0.445290e 1.208784dtype: float64s[[4, 3, 1]]Out[75]: e 1.208784d -0.445290b -2.044172dtype: float64
Series和dict
如果应用label来拜访Series,那么它的体现就和dict很像:
s['a']Out[80]: -1.3007972194268396s['e'] = 12.sOut[82]: a -1.300797b -2.044172c -1.170739d -0.445290e 12.000000dtype: float64
矢量化操作和标签对齐
Series能够应用更加简略的矢量化操作:
s + sOut[83]: a -2.601594b -4.088344c -2.341477d -0.890581e 24.000000dtype: float64s * 2Out[84]: a -2.601594b -4.088344c -2.341477d -0.890581e 24.000000dtype: float64np.exp(s)Out[85]: a 0.272315b 0.129487c 0.310138d 0.640638e 162754.791419dtype: float64
Name属性
Series还有一个name属性,咱们能够在创立的时候进行设置:
s = pd.Series(np.random.randn(5), name='something')sOut[88]: 0 0.1922721 0.1104102 1.4423583 -0.3757924 1.228111Name: something, dtype: float64
s还有一个rename办法,能够重命名s:
s2 = s.rename("different")
DataFrame
DataFrame是一个二维的带label的数据结构,它是由Series组成的,你能够把DataFrame看成是一个excel表格。DataFrame能够由上面几种数据来创立:
- 一维的ndarrays, lists, dicts, 或者 Series
- 结构化数组创立
- 2维的numpy.ndarray
- 其余的DataFrame
从Series创立
能够从Series形成的字典中来创立DataFrame:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}df = pd.DataFrame(d)dfOut[92]: one twoa 1.0 1.0b 2.0 2.0c 3.0 3.0d NaN 4.0
进行index重排:
pd.DataFrame(d, index=['d', 'b', 'a'])Out[93]: one twod NaN 4.0b 2.0 2.0a 1.0 1.0
进行列重排:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])Out[94]: two threed 4.0 NaNb 2.0 NaNa 1.0 NaN
从ndarrays 和 lists创立
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}pd.DataFrame(d)Out[96]: one two0 1.0 4.01 2.0 3.02 3.0 2.03 4.0 1.0pd.DataFrame(d, index=['a', 'b', 'c', 'd'])Out[97]: one twoa 1.0 4.0b 2.0 3.0c 3.0 2.0d 4.0 1.0
从结构化数组创立
能够从结构化数组中创立DF:
In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]In [49]: pd.DataFrame(data)Out[49]: A B C0 1 2.0 b'Hello'1 2 3.0 b'World'In [50]: pd.DataFrame(data, index=['first', 'second'])Out[50]: A B Cfirst 1 2.0 b'Hello'second 2 3.0 b'World'In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])Out[51]: C A B0 b'Hello' 1 2.01 b'World' 2 3.0
从字典list创立
In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]In [53]: pd.DataFrame(data2)Out[53]: a b c0 1 2 NaN1 5 10 20.0In [54]: pd.DataFrame(data2, index=['first', 'second'])Out[54]: a b cfirst 1 2 NaNsecond 5 10 20.0In [55]: pd.DataFrame(data2, columns=['a', 'b'])Out[55]: a b0 1 21 5 10
从元组中创立
能够从元组中创立更加简单的DF:
In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}) ....: Out[56]: a b b a c a bA B 1.0 4.0 5.0 8.0 10.0 C 2.0 3.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
列抉择,增加和删除
能够像操作Series一样操作DF:
In [64]: df['one']Out[64]: a 1.0b 2.0c 3.0d NaNName: one, dtype: float64In [65]: df['three'] = df['one'] * df['two']In [66]: df['flag'] = df['one'] > 2In [67]: dfOut[67]: one two three flaga 1.0 1.0 1.0 Falseb 2.0 2.0 4.0 Falsec 3.0 3.0 9.0 Trued NaN 4.0 NaN False
能够删除特定的列,或者pop操作:
In [68]: del df['two']In [69]: three = df.pop('three')In [70]: dfOut[70]: one flaga 1.0 Falseb 2.0 Falsec 3.0 Trued NaN False
如果插入常量,那么会填满整个列:
In [71]: df['foo'] = 'bar'In [72]: dfOut[72]: one flag fooa 1.0 False barb 2.0 False barc 3.0 True bard NaN False bar
默认会插入到DF中最初一列,能够应用insert来指定插入到特定的列:
In [75]: df.insert(1, 'bar', df['one'])In [76]: dfOut[76]: one bar flag foo one_trunca 1.0 1.0 False bar 1.0b 2.0 2.0 False bar 2.0c 3.0 3.0 True bar NaNd NaN NaN False bar NaN
应用assign 能够从现有的列中衍生出新的列:
In [77]: iris = pd.read_csv('data/iris.data')In [78]: iris.head()Out[78]: SepalLength SepalWidth PetalLength PetalWidth Name0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosaIn [79]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']) ....: .head()) ....: Out[79]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio0 5.1 3.5 1.4 0.2 Iris-setosa 0.6862751 4.9 3.0 1.4 0.2 Iris-setosa 0.6122452 4.7 3.2 1.3 0.2 Iris-setosa 0.6808513 4.6 3.1 1.5 0.2 Iris-setosa 0.6739134 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
留神, assign 会创立一个新的DF,原DF放弃不变。
上面用一张表来示意DF中的index和抉择:
操作 | 语法 | 返回后果 |
---|---|---|
抉择列 | df[col] | Series |
通过label抉择行 | df.loc[label] | Series |
通过数组抉择行 | df.iloc[loc] | Series |
行的切片 | df[5:10] | DataFrame |
应用boolean向量抉择行 | df[bool_vec] | DataFrame |
本文已收录于 http://www.flydean.com/03-python-pandas-data-structures/
最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!
欢送关注我的公众号:「程序那些事」,懂技术,更懂你!