[toc]

简介

本文将会解说Pandas中根本的数据类型Series和DataFrame,并具体解说这两种类型的创立,索引等根本行为。

应用Pandas须要援用上面的lib:

In [1]: import numpy as npIn [2]: import pandas as pd

Series

Series是一维带label和index的数组。咱们应用上面的办法来创立一个Series:

>>> s = pd.Series(data, index=index)

这里的data能够是Python的字典,np的ndarray,或者一个标量。

index是一个横轴label的list。接下来咱们别离来看下怎么创立Series。

ndarray创立

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])sOut[67]: a   -1.300797b   -2.044172c   -1.170739d   -0.445290e    1.208784dtype: float64

应用index获取index:

s.indexOut[68]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

从dict创立

d = {'b': 1, 'a': 0, 'c': 2}pd.Series(d)Out[70]: a    0b    1c    2dtype: int64

从标量创立

pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])Out[71]: a    5.0b    5.0c    5.0d    5.0e    5.0dtype: float64

Series 和 ndarray

Series和ndarray是很相似的,在Series中应用index数值体现的就像ndarray:

s[0]Out[72]: -1.3007972194268396s[:3]Out[73]: a   -1.300797b   -2.044172c   -1.170739dtype: float64s[s > s.median()]Out[74]: d   -0.445290e    1.208784dtype: float64s[[4, 3, 1]]Out[75]: e    1.208784d   -0.445290b   -2.044172dtype: float64

Series和dict

如果应用label来拜访Series,那么它的体现就和dict很像:

s['a']Out[80]: -1.3007972194268396s['e'] = 12.sOut[82]: a    -1.300797b    -2.044172c    -1.170739d    -0.445290e    12.000000dtype: float64

矢量化操作和标签对齐

Series能够应用更加简略的矢量化操作:

s + sOut[83]: a    -2.601594b    -4.088344c    -2.341477d    -0.890581e    24.000000dtype: float64s * 2Out[84]: a    -2.601594b    -4.088344c    -2.341477d    -0.890581e    24.000000dtype: float64np.exp(s)Out[85]: a         0.272315b         0.129487c         0.310138d         0.640638e    162754.791419dtype: float64

Name属性

Series还有一个name属性,咱们能够在创立的时候进行设置:

s = pd.Series(np.random.randn(5), name='something')sOut[88]: 0    0.1922721    0.1104102    1.4423583   -0.3757924    1.228111Name: something, dtype: float64

s还有一个rename办法,能够重命名s:

s2 = s.rename("different")

DataFrame

DataFrame是一个二维的带label的数据结构,它是由Series组成的,你能够把DataFrame看成是一个excel表格。DataFrame能够由上面几种数据来创立:

  • 一维的ndarrays, lists, dicts, 或者 Series
  • 结构化数组创立
  • 2维的numpy.ndarray
  • 其余的DataFrame

从Series创立

能够从Series形成的字典中来创立DataFrame:

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}df = pd.DataFrame(d)dfOut[92]:    one  twoa  1.0  1.0b  2.0  2.0c  3.0  3.0d  NaN  4.0

进行index重排:

pd.DataFrame(d, index=['d', 'b', 'a'])Out[93]:    one  twod  NaN  4.0b  2.0  2.0a  1.0  1.0

进行列重排:

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])Out[94]:    two threed  4.0   NaNb  2.0   NaNa  1.0   NaN

从ndarrays 和 lists创立

d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}pd.DataFrame(d)Out[96]:    one  two0  1.0  4.01  2.0  3.02  3.0  2.03  4.0  1.0pd.DataFrame(d, index=['a', 'b', 'c', 'd'])Out[97]:    one  twoa  1.0  4.0b  2.0  3.0c  3.0  2.0d  4.0  1.0

从结构化数组创立

能够从结构化数组中创立DF:

In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]In [49]: pd.DataFrame(data)Out[49]:    A    B         C0  1  2.0  b'Hello'1  2  3.0  b'World'In [50]: pd.DataFrame(data, index=['first', 'second'])Out[50]:         A    B         Cfirst   1  2.0  b'Hello'second  2  3.0  b'World'In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])Out[51]:           C  A    B0  b'Hello'  1  2.01  b'World'  2  3.0

从字典list创立

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]In [53]: pd.DataFrame(data2)Out[53]:    a   b     c0  1   2   NaN1  5  10  20.0In [54]: pd.DataFrame(data2, index=['first', 'second'])Out[54]:         a   b     cfirst   1   2   NaNsecond  5  10  20.0In [55]: pd.DataFrame(data2, columns=['a', 'b'])Out[55]:    a   b0  1   21  5  10

从元组中创立

能够从元组中创立更加简单的DF:

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})   ....: Out[56]:        a              b             b    a    c    a     bA B  1.0  4.0  5.0  8.0  10.0  C  2.0  3.0  6.0  7.0   NaN  D  NaN  NaN  NaN  NaN   9.0

列抉择,增加和删除

能够像操作Series一样操作DF:

In [64]: df['one']Out[64]: a    1.0b    2.0c    3.0d    NaNName: one, dtype: float64In [65]: df['three'] = df['one'] * df['two']In [66]: df['flag'] = df['one'] > 2In [67]: dfOut[67]:    one  two  three   flaga  1.0  1.0    1.0  Falseb  2.0  2.0    4.0  Falsec  3.0  3.0    9.0   Trued  NaN  4.0    NaN  False

能够删除特定的列,或者pop操作:

In [68]: del df['two']In [69]: three = df.pop('three')In [70]: dfOut[70]:    one   flaga  1.0  Falseb  2.0  Falsec  3.0   Trued  NaN  False

如果插入常量,那么会填满整个列:

In [71]: df['foo'] = 'bar'In [72]: dfOut[72]:    one   flag  fooa  1.0  False  barb  2.0  False  barc  3.0   True  bard  NaN  False  bar

默认会插入到DF中最初一列,能够应用insert来指定插入到特定的列:

In [75]: df.insert(1, 'bar', df['one'])In [76]: dfOut[76]:    one  bar   flag  foo  one_trunca  1.0  1.0  False  bar        1.0b  2.0  2.0  False  bar        2.0c  3.0  3.0   True  bar        NaNd  NaN  NaN  False  bar        NaN

应用assign 能够从现有的列中衍生出新的列:

In [77]: iris = pd.read_csv('data/iris.data')In [78]: iris.head()Out[78]:    SepalLength  SepalWidth  PetalLength  PetalWidth         Name0          5.1         3.5          1.4         0.2  Iris-setosa1          4.9         3.0          1.4         0.2  Iris-setosa2          4.7         3.2          1.3         0.2  Iris-setosa3          4.6         3.1          1.5         0.2  Iris-setosa4          5.0         3.6          1.4         0.2  Iris-setosaIn [79]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])   ....:      .head())   ....: Out[79]:    SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio0          5.1         3.5          1.4         0.2  Iris-setosa     0.6862751          4.9         3.0          1.4         0.2  Iris-setosa     0.6122452          4.7         3.2          1.3         0.2  Iris-setosa     0.6808513          4.6         3.1          1.5         0.2  Iris-setosa     0.6739134          5.0         3.6          1.4         0.2  Iris-setosa     0.720000
留神, assign 会创立一个新的DF,原DF放弃不变。

上面用一张表来示意DF中的index和抉择:

操作语法返回后果
抉择列df[col]Series
通过label抉择行df.loc[label]Series
通过数组抉择行df.iloc[loc]Series
行的切片df[5:10]DataFrame
应用boolean向量抉择行df[bool_vec]DataFrame

本文已收录于 http://www.flydean.com/03-python-pandas-data-structures/

最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!

欢送关注我的公众号:「程序那些事」,懂技术,更懂你!