共计 6548 个字符,预计需要花费 17 分钟才能阅读完成。
简介
在 1.0 之前,只有一种模式来存储 text 数据,那就是 object。在 1.0 之后,增加了一个新的数据类型叫做 StringDtype。明天将会给大家解说 Pandas 中 text 中的那些事。
创立 text 的 DF
先看下常见的应用 text 来构建 DF 的例子:
In [1]: pd.Series(['a', 'b', 'c'])
Out[1]:
0 a
1 b
2 c
dtype: object
如果要应用新的 StringDtype,能够这样:
In [2]: pd.Series(['a', 'b', 'c'], dtype="string")
Out[2]:
0 a
1 b
2 c
dtype: string
In [3]: pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
Out[3]:
0 a
1 b
2 c
dtype: string
或者应用 astype 进行转换:
In [4]: s = pd.Series(['a', 'b', 'c'])
In [5]: s
Out[5]:
0 a
1 b
2 c
dtype: object
In [6]: s.astype("string")
Out[6]:
0 a
1 b
2 c
dtype: string
String 的办法
String 能够转换成大写,小写和统计它的长度:
In [24]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
....: dtype="string")
....:
In [25]: s.str.lower()
Out[25]:
0 a
1 b
2 c
3 aaba
4 baca
5 <NA>
6 caba
7 dog
8 cat
dtype: string
In [26]: s.str.upper()
Out[26]:
0 A
1 B
2 C
3 AABA
4 BACA
5 <NA>
6 CABA
7 DOG
8 CAT
dtype: string
In [27]: s.str.len()
Out[27]:
0 1
1 1
2 1
3 4
4 4
5 <NA>
6 4
7 3
8 3
dtype: Int64
还能够进行 trip 操作:
In [28]: idx = pd.Index(['jack', 'jill', 'jesse', 'frank'])
In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [31]: idx.str.rstrip()
Out[31]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
columns 的 String 操作
因为 columns 是 String 示意的,所以能够依照一般的 String 形式来操作 columns:
In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')
In [35]: df.columns.str.lower()
Out[35]: Index(['column a', 'column b'], dtype='object')
In [32]: df = pd.DataFrame(np.random.randn(3, 2),
....: columns=['Column A', 'Column B'], index=range(3))
....:
In [33]: df
Out[33]:
Column A Column B
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
宰割和替换 String
Split 能够将一个 String 切分成一个数组。
In [38]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
In [39]: s2.str.split('_')
Out[39]:
0 [a, b, c]
1 [c, d, e]
2 <NA>
3 [f, g, h]
dtype: object
要想拜访 split 之后数组中的字符,能够这样:
In [40]: s2.str.split('_').str.get(1)
Out[40]:
0 b
1 d
2 <NA>
3 g
dtype: object
In [41]: s2.str.split('_').str[1]
Out[41]:
0 b
1 d
2 <NA>
3 g
dtype: object
应用 expand=True 能够 将 split 过后的数组 扩大成为多列:
In [42]: s2.str.split('_', expand=True)
Out[42]:
0 1 2
0 a b c
1 c d e
2 <NA> <NA> <NA>
3 f g h
能够指定宰割列的个数:
In [43]: s2.str.split('_', expand=True, n=1)
Out[43]:
0 1
0 a b_c
1 c d_e
2 <NA> <NA>
3 f g_h
replace 用来进行字符的替换,在替换过程中还能够应用正则表达式:
s3.str.replace('^.a|dog', 'XX-XX', case=False)
String 的连贯
应用 cat 能够连贯 String:
In [64]: s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")
In [65]: s.str.cat(sep=',')
Out[65]: 'a,b,c,d'
应用 .str 来 index
pd.Series 会返回一个 Series,如果 Series 中是字符串的话,可通过 index 来拜访列的字符,举个例子:
In [99]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
....: 'CABA', 'dog', 'cat'],
....: dtype="string")
....:
In [100]: s.str[0]
Out[100]:
0 A
1 B
2 C
3 A
4 B
5 <NA>
6 C
7 d
8 c
dtype: string
In [101]: s.str[1]
Out[101]:
0 <NA>
1 <NA>
2 <NA>
3 a
4 a
5 <NA>
6 A
7 o
8 a
dtype: string
extract
Extract 用来从 String 中解压数据,它接管一个 expand 参数,在 0.23 版本之前,这个参数默认是 False。如果是 false,extract 会返回 Series,index 或者 DF。如果 expand=true,那么会返回 DF。0.23 版本之后,默认是 true。
extract 通常是和正则表达式一起应用的。
In [102]: pd.Series(['a1', 'b2', 'c3'],
.....: dtype="string").str.extract(r'([ab])(\d)', expand=False)
.....:
Out[102]:
0 1
0 a 1
1 b 2
2 <NA> <NA>
下面的例子将 Series 中的每一字符串都依照正则表达式来进行合成。后面一部分是字符,前面一部分是数字。
留神,只有正则表达式中 group 的数据才会被 extract .
上面的就只会 extract 数字:
In [106]: pd.Series(['a1', 'b2', 'c3'],
.....: dtype="string").str.extract(r'[ab](\d)', expand=False)
.....:
Out[106]:
0 1
1 2
2 <NA>
dtype: string
还能够指定列的名字如下:
In [103]: pd.Series(['a1', 'b2', 'c3'],
.....: dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)',
.....: expand=False)
.....:
Out[103]:
letter digit
0 a 1
1 b 2
2 <NA> <NA>
extractall
和 extract 类似的还有 extractall,不同的是 extract 只会匹配第一次,而 extractall 会做所有的匹配,举个例子:
In [112]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],
.....: dtype="string")
.....:
In [113]: s
Out[113]:
A a1a2
B b1
C c1
dtype: string
In [114]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
In [115]: s.str.extract(two_groups, expand=True)
Out[115]:
letter digit
A a 1
B b 1
C c 1
extract 匹配到 a1 之后就不会持续了。
In [116]: s.str.extractall(two_groups)
Out[116]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 c 1
extractall 匹配了 a1 之后还会匹配 a2。
contains 和 match
contains 和 match 用来测试 DF 中是否含有特定的数据:
In [127]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
.....: dtype="string").str.contains(pattern)
.....:
Out[127]:
0 False
1 False
2 True
3 True
4 True
5 True
dtype: boolean
In [128]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
.....: dtype="string").str.match(pattern)
.....:
Out[128]:
0 False
1 False
2 True
3 True
4 False
5 True
dtype: boolean
In [129]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
.....: dtype="string").str.fullmatch(pattern)
.....:
Out[129]:
0 False
1 False
2 True
3 True
4 False
5 False
dtype: boolean
String 办法总结
最初总结一下 String 的办法:
Method | Description |
---|---|
cat() | Concatenate strings |
split() | Split strings on delimiter |
rsplit() | Split strings on delimiter working from the end of the string |
get() | Index into each element (retrieve i-th element) |
join() | Join strings in each element of the Series with passed separator |
get_dummies() | Split strings on the delimiter returning DataFrame of dummy variables |
contains() | Return boolean array if each string contains pattern/regex |
replace() | Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence |
repeat() | Duplicate values (s.str.repeat(3) equivalent to x * 3) |
pad() | Add whitespace to left, right, or both sides of strings |
center() | Equivalent to str.center |
ljust() | Equivalent to str.ljust |
rjust() | Equivalent to str.rjust |
zfill() | Equivalent to str.zfill |
wrap() | Split long strings into lines with length less than a given width |
slice() | Slice each string in the Series |
slice_replace() | Replace slice in each string with passed value |
count() | Count occurrences of pattern |
startswith() | Equivalent to str.startswith(pat) for each element |
endswith() | Equivalent to str.endswith(pat) for each element |
findall() | Compute list of all occurrences of pattern/regex for each string |
match() | Call re.match on each element, returning matched groups as list |
extract() | Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group |
extractall() | Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group |
len() | Compute string lengths |
strip() | Equivalent to str.strip |
rstrip() | Equivalent to str.rstrip |
lstrip() | Equivalent to str.lstrip |
partition() | Equivalent to str.partition |
rpartition() | Equivalent to str.rpartition |
lower() | Equivalent to str.lower |
casefold() | Equivalent to str.casefold |
upper() | Equivalent to str.upper |
find() | Equivalent to str.find |
rfind() | Equivalent to str.rfind |
index() | Equivalent to str.index |
rindex() | Equivalent to str.rindex |
capitalize() | Equivalent to str.capitalize |
swapcase() | Equivalent to str.swapcase |
normalize() | Return Unicode normal form. Equivalent to unicodedata.normalize |
translate() | Equivalent to str.translate |
isalnum() | Equivalent to str.isalnum |
isalpha() | Equivalent to str.isalpha |
isdigit() | Equivalent to str.isdigit |
isspace() | Equivalent to str.isspace |
islower() | Equivalent to str.islower |
isupper() | Equivalent to str.isupper |
istitle() | Equivalent to str.istitle |
isnumeric() | Equivalent to str.isnumeric |
isdecimal() | Equivalent to str.isdecimal |
本文已收录于 http://www.flydean.com/06-python-pandas-text/
最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!
欢送关注我的公众号:「程序那些事」, 懂技术,更懂你!