简介

在1.0之前，只有一种模式来存储text数据，那就是object。在1.0之后，增加了一个新的数据类型叫做StringDtype 。明天将会给大家解说Pandas中text中的那些事。

创立text的DF

先看下常见的应用text来构建DF的例子：

In [1]: pd.Series(['a', 'b', 'c'])Out[1]: 0    a1    b2    cdtype: object

如果要应用新的StringDtype，能够这样：

In [2]: pd.Series(['a', 'b', 'c'], dtype="string")Out[2]: 0    a1    b2    cdtype: stringIn [3]: pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())Out[3]: 0    a1    b2    cdtype: string

或者应用astype进行转换：

In [4]: s = pd.Series(['a', 'b', 'c'])In [5]: sOut[5]: 0    a1    b2    cdtype: objectIn [6]: s.astype("string")Out[6]: 0    a1    b2    cdtype: string

String 的办法

String能够转换成大写，小写和统计它的长度：

In [24]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],   ....:               dtype="string")   ....: In [25]: s.str.lower()Out[25]: 0       a1       b2       c3    aaba4    baca5    <NA>6    caba7     dog8     catdtype: stringIn [26]: s.str.upper()Out[26]: 0       A1       B2       C3    AABA4    BACA5    <NA>6    CABA7     DOG8     CATdtype: stringIn [27]: s.str.len()Out[27]: 0       11       12       13       44       45    <NA>6       47       38       3dtype: Int64

还能够进行trip操作：

In [28]: idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])In [29]: idx.str.strip()Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')In [30]: idx.str.lstrip()Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')In [31]: idx.str.rstrip()Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

columns的String操作

因为columns是String示意的，所以能够依照一般的String形式来操作columns：

In [34]: df.columns.str.strip()Out[34]: Index(['Column A', 'Column B'], dtype='object')In [35]: df.columns.str.lower()Out[35]: Index([' column a ', ' column b '], dtype='object')

In [32]: df = pd.DataFrame(np.random.randn(3, 2),   ....:                   columns=[' Column A ', ' Column B '], index=range(3))   ....: In [33]: dfOut[33]:     Column A    Column B 0    0.469112   -0.2828631   -1.509059   -1.1356322    1.212112   -0.173215

宰割和替换String

Split能够将一个String切分成一个数组。

In [38]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")In [39]: s2.str.split('_')Out[39]: 0    [a, b, c]1    [c, d, e]2         <NA>3    [f, g, h]dtype: object

要想拜访split之后数组中的字符，能够这样：

In [40]: s2.str.split('_').str.get(1)Out[40]: 0       b1       d2    <NA>3       gdtype: objectIn [41]: s2.str.split('_').str[1]Out[41]: 0       b1       d2    <NA>3       gdtype: object

应用 expand=True 能够将split过后的数组扩大成为多列：

In [42]: s2.str.split('_', expand=True)Out[42]:       0     1     20     a     b     c1     c     d     e2  <NA>  <NA>  <NA>3     f     g     h

能够指定宰割列的个数：

In [43]: s2.str.split('_', expand=True, n=1)Out[43]:       0     10     a   b_c1     c   d_e2  <NA>  <NA>3     f   g_h

replace用来进行字符的替换，在替换过程中还能够应用正则表达式：

s3.str.replace('^.a|dog', 'XX-XX ', case=False)

String的连贯

应用cat 能够连贯 String：

In [64]: s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")In [65]: s.str.cat(sep=',')Out[65]: 'a,b,c,d'

应用 .str来index

pd.Series会返回一个Series，如果Series中是字符串的话，可通过index来拜访列的字符，举个例子：

In [99]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,   ....:                'CABA', 'dog', 'cat'],   ....:               dtype="string")   ....: In [100]: s.str[0]Out[100]: 0       A1       B2       C3       A4       B5    <NA>6       C7       d8       cdtype: stringIn [101]: s.str[1]Out[101]: 0    <NA>1    <NA>2    <NA>3       a4       a5    <NA>6       A7       o8       adtype: string

extract

Extract用来从String中解压数据，它接管一个 expand参数，在0.23版本之前，这个参数默认是False。如果是false，extract会返回Series，index或者DF 。如果expand=true，那么会返回DF。0.23版本之后，默认是true。

extract通常是和正则表达式一起应用的。

In [102]: pd.Series(['a1', 'b2', 'c3'],   .....:           dtype="string").str.extract(r'([ab])(\d)', expand=False)   .....: Out[102]:       0     10     a     11     b     22  <NA>  <NA>

下面的例子将Series中的每一字符串都依照正则表达式来进行合成。后面一部分是字符，前面一部分是数字。

留神，只有正则表达式中group的数据才会被extract .

上面的就只会extract数字：

In [106]: pd.Series(['a1', 'b2', 'c3'],   .....:           dtype="string").str.extract(r'[ab](\d)', expand=False)   .....: Out[106]: 0       11       22    <NA>dtype: string

还能够指定列的名字如下：

In [103]: pd.Series(['a1', 'b2', 'c3'],   .....:           dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)',   .....:                                       expand=False)   .....: Out[103]:   letter digit0      a     11      b     22   <NA>  <NA>

extractall

和extract类似的还有extractall，不同的是extract只会匹配第一次，而extractall会做所有的匹配，举个例子：

In [112]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],   .....:               dtype="string")   .....: In [113]: sOut[113]: A    a1a2B      b1C      c1dtype: stringIn [114]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'In [115]: s.str.extract(two_groups, expand=True)Out[115]:   letter digitA      a     1B      b     1C      c     1

extract匹配到a1之后就不会持续了。

In [116]: s.str.extractall(two_groups)Out[116]:         letter digit  match             A 0          a     1  1          a     2B 0          b     1C 0          c     1

extractall匹配了a1之后还会匹配a2。

contains 和 match

contains 和 match用来测试DF中是否含有特定的数据：

In [127]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],   .....:           dtype="string").str.contains(pattern)   .....: Out[127]: 0    False1    False2     True3     True4     True5     Truedtype: boolean

In [128]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],   .....:           dtype="string").str.match(pattern)   .....: Out[128]: 0    False1    False2     True3     True4    False5     Truedtype: boolean

In [129]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],   .....:           dtype="string").str.fullmatch(pattern)   .....: Out[129]: 0    False1    False2     True3     True4    False5    Falsedtype: boolean

String办法总结

最初总结一下String的办法：

Method	Description
cat()	Concatenate strings
split()	Split strings on delimiter
rsplit()	Split strings on delimiter working from the end of the string
get()	Index into each element (retrieve i-th element)
join()	Join strings in each element of the Series with passed separator
get_dummies()	Split strings on the delimiter returning DataFrame of dummy variables
contains()	Return boolean array if each string contains pattern/regex
replace()	Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
repeat()	Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad()	Add whitespace to left, right, or both sides of strings
center()	Equivalent to str.center
ljust()	Equivalent to str.ljust
rjust()	Equivalent to str.rjust
zfill()	Equivalent to str.zfill
wrap()	Split long strings into lines with length less than a given width
slice()	Slice each string in the Series
slice_replace()	Replace slice in each string with passed value
count()	Count occurrences of pattern
startswith()	Equivalent to str.startswith(pat) for each element
endswith()	Equivalent to str.endswith(pat) for each element
findall()	Compute list of all occurrences of pattern/regex for each string
match()	Call re.match on each element, returning matched groups as list
extract()	Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group
extractall()	Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group
len()	Compute string lengths
strip()	Equivalent to str.strip
rstrip()	Equivalent to str.rstrip
lstrip()	Equivalent to str.lstrip
partition()	Equivalent to str.partition
rpartition()	Equivalent to str.rpartition
lower()	Equivalent to str.lower
casefold()	Equivalent to str.casefold
upper()	Equivalent to str.upper
find()	Equivalent to str.find
rfind()	Equivalent to str.rfind
index()	Equivalent to str.index
rindex()	Equivalent to str.rindex
capitalize()	Equivalent to str.capitalize
swapcase()	Equivalent to str.swapcase
normalize()	Return Unicode normal form. Equivalent to unicodedata.normalize
translate()	Equivalent to str.translate
isalnum()	Equivalent to str.isalnum
isalpha()	Equivalent to str.isalpha
isdigit()	Equivalent to str.isdigit
isspace()	Equivalent to str.isspace
islower()	Equivalent to str.islower
isupper()	Equivalent to str.isupper
istitle()	Equivalent to str.istitle
isnumeric()	Equivalent to str.isnumeric
isdecimal()	Equivalent to str.isdecimal

本文已收录于 http://www.flydean.com/06-python-pandas-text/
最艰深的解读，最粗浅的干货，最简洁的教程，泛滥你不晓得的小技巧等你来发现！
欢送关注我的公众号:「程序那些事」,懂技术，更懂你！