关于python:翻译实用的Python编程0104Strings

目录 | 上一节 (1.3 数字) | 下一节 (1.5 列表)

本节介绍解决文本的办法。

在程序中字符串字面量应用引号来书写。

# 单引号（Single quote）a = 'Yeah but no but yeah but...'

# 双引号（Double quote）b = "computer says no"

# 三引号（Triple quotes）c = '''
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes,
don't look around the eyes,
look into my eyes, you're under.'''

通常，字符串只能占一行。三引号捕捉在引号完结之前的所有文本，蕴含所有的格局。

应用单引号（’）和双引号（“）没有区别。然而，以什么样的引号开始字符串，必须以什么样的引号完结字符串。

本义码被用于示意控制字符和不能轻易在键盘上输出的字符。以下是一些常见的本义码：

'\n'      换行符（Line feed）'\r'      回车符（Carriage return）'\t'      制表符（Tab）'\''      字面量单引号（Literal single quote）'\"'      字面量双引号（Literal double quote）'\\'      字面量反斜杠（Literal backslash）

字符串中的每个字符在外部被存储为所谓的 Unicode“代码点（code-point）”，代码点是一个整数。能够应用下列转移序列指定确切的代码点。

a = '\xf1'          # a = 'ñ'
b = '\u2200'        # b = '∀'
c = '\U0001D122'    # c = '????'
d = '\N{FOR ALL}'   # d = '∀'

所有可用的字符码请参考 Unicode 字符数据库。

能够像拜访数组那样拜访字符串的单个字符。你能够应用从 0 开始的整数索引，负索引指定绝对于字符串尾部的地位。

a = 'Hello world'
b = a[0]          # 'H'
c = a[4]          # 'o'
d = a[-1]         # 'd' (end of string)

你也能够指定一个索引范畴来切割或者抉择子串：

d = a[:5]     # 'Hello'
e = a[6:]     # 'world'
f = a[3:8]    # 'lo wo'
g = a[-5:]    # 'world'

不包含结尾索引处的字符。缺失的索引假设为字符串的开始或者结尾。

字符串的操作包含：拼接，长度计算，成员判断和复制。

# Concatenation (+)
a = 'Hello' + 'World'   # 'HelloWorld'
b = 'Say' + a          # 'Say HelloWorld'

# Length (len)
s = 'Hello'
len(s)                  # 5

# Membership test (`in`, `not in`)
t = 'e' in s            # True
f = 'x' in s            # False
g = 'hi' not in s       # True

# Replication (s * n)
rep = s * 5             # 'HelloHelloHelloHelloHello'

字符串具备对数据执行各种操作的办法。

示例：删除结尾或者结尾处的任何空白。

s = 'Hello'
t = s.strip()     # 'Hello'

示例：大小写转换。

s = 'Hello'
l = s.lower()     # 'hello'
u = s.upper()     # 'HELLO'

示例：文本替换。

s = 'Hello world'
t = s.replace('Hello' , 'Hallo')   # 'Hallo world'

更多字符串办法:

字符串具备各种各样的办法用于测试和解决文本数据。

上面是字符串办法的一小部分示例：

s.endswith(suffix)     # Check if string ends with suffix
s.find(t)              # First occurrence of t in s
s.index(t)             # First occurrence of t in s
s.isalpha()            # Check if characters are alphabetic
s.isdigit()            # Check if characters are numeric
s.islower()            # Check if characters are lower-case
s.isupper()            # Check if characters are upper-case
s.join(slist)          # Join a list of strings using s as delimiter
s.lower()              # Convert to lower case
s.replace(old,new)     # Replace text
s.rfind(t)             # Search for t from end of string
s.rindex(t)            # Search for t from end of string
s.split([delim])       # Split string into list of substrings
s.startswith(prefix)   # Check if string starts with prefix
s.strip()              # Strip leading/trailing space
s.upper()              # Convert to upper case

字符串是“不可变的”或者说是只读的。一旦创立，字符串的值就无奈批改。

>>> s = 'Hello World'
>>> s[1] = 'a'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>>

所有解决字符串数据的操作和办法始终会创立一个新的字符串。

应用 str() 函数能够将任何值转换为字符串。str() 函数失去的后果是一个字符串，该字符串蕴含的文本与 print() 语句产生的文本雷同。

>>> x = 42
>>> str(x)
'42'
>>>

通常，在底层 I/O 中会遇到 8 位字节的字符串（译注：字节字符串），它是这样写的：

data = b'Hello World\r\n'

通过把小写的 b 放到第一个引号之前来指定一个字节字符串而不是文本字符串（ 译注：在字符串后面加 b 示意这是应用 ASCII 编码的字节字符串 ）。大部分罕用的文本字符串操作可利用于字节字符串。

len(data)                         # 13
data[0:5]                         # b'Hello'
data.replace(b'Hello', b'Cruel')  # b'Cruel World\r\n'

字节字符串索引有点不同，因为它返回的是整数模式的字节值：

data[0]   # 72 (ASCII code for 'H')

字节字符串与文本字符串之间的转换：

text = data.decode('utf-8') # bytes -> text
data = text.encode('utf-8') # text -> bytes

'utf-8' 这个参数指定了字符的编码方式。其它常见的编码方式有 'ascii' 和 'latin1'。

原始字符串是未解释的带有反斜杠的字符串字面量。通过在原始引号之前增加“r”前缀来指定。

>>> rs = r'c:\newdata\test' # Raw (uninterpreted backslash)
>>> rs
'c:\\newdata\\test'

输入的字符串是蕴含在引号外面的字面量文本，与输出的文本完全相同。这在反斜杠有非凡意义的状况下很有用。例如：文件名、正则表达式等。

具备格式化表达式替换的字符串。

>>> name = 'IBM'
>>> shares = 100
>>> price = 91.1
>>> a = f'{name:>10s} {shares:10d} {price:10.2f}'
>>> a
'IBM        100      91.10'
>>> b = f'Cost = ${shares*price:0.2f}'
>>> b
'Cost = $9110.00'
>>>

留神: 这要求 Python 3.6 或者更新的版本. 格式化代码的含意稍后介绍。

在这些习题中，你将尝试对 Python 字符串类型进行操作。你应该在 Python 交互提示符下操作，在该提示符下能够轻松地查看到后果。重要提醒：

在应该与解释器进行交互的习题中，
>>> 当 Python 心愿你输出一个新的语句，你将取得一个解释器提示符。习题中某些语句会逾越多行——要使这些语句执行，你可能须要多按几次回车键。揭示你，在做这些示例时，请勿输出 >>> 提示符。

通过定义一个蕴含一系列股票代号的字符串开始吧。字符串如下所示：

>>> symbols = 'AAPL,IBM,MSFT,YHOO,SCO'
>>>

字符串是字符数组。尝试提取一些字符：

>>> symbols[0]
?
>>> symbols[1]
?
>>> symbols[2]
?
>>> symbols[-1]        # Last character
?
>>> symbols[-2]        # Negative indices are from end of string
?
>>>

在 Python 语言中，字符串是只读的。

尝试通过将 symbols 字符串的第一个字符变为小写字母‘a’来验证这一点。

>>> symbols[0] = 'a'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>>

只管字符串数据是只读的，然而你始终能够将变量重新分配给新创建的字符串。

尝试上面的语句，该语句将一个新的股票代码“GOOG”拼接到 symbols 字符串的开端。

>>> symbols = symbols + 'GOOG'
>>> symbols
'AAPL,IBM,MSFT,YHOO,SCOGOOG'
>>>

蹩脚！这不是咱们想要的。批改它使得变量 symbols 保留的值为 'AAPL,IBM,MSFT,YHOO,SCO,GOOG'。

>>> symbols = ?
>>> symbols
'AAPL,IBM,MSFT,YHOO,SCO,GOOG'
>>>

把 'HPQ' 增加到 symbols 字符串的后面：

>>> symbols = ?
>>> symbols
'HPQ,AAPL,IBM,MSFT,YHOO,SCO,GOOG'
>>>

在这些示例中，外表上看起来原始字符串像正在被批改，显著违反了字符串是只读的。实际上不是这样的。每次，这些操作都会创立一个全新的字符串。当变量名 symbols 被重新分配，它会指向一个新创建的字符串。而后，旧的字符串被销毁，因为它不再被应用了。

尝试应用 in 操作符查看子串。请在交互提示符下尝试这些操作。

>>> 'IBM' in symbols
?
>>> 'AA' in symbols
True
>>> 'CAT' in symbols
?
>>>

为什么查看 AA 的时候返回 True ?

在 Python 交互提示符下，尝试一些新的字符串办法。

>>> symbols.lower()
?
>>> symbols
?
>>>

请记住，字符串始终是只读的。如果你想要保留操作的后果，你须要把它搁置到一个变量中。

>>> lowersyms = symbols.lower()
>>>

请尝试更多的操作：

>>> symbols.find('MSFT')
?
>>> symbols[13:17]
?
>>> symbols = symbols.replace('SCO','DOA')
>>> symbols
?
>>> name = 'IBM   \n'
>>> name = name.strip()    # Remove surrounding whitespace
>>> name
?
>>>

有时你想创立一个字符串并把其它变量的值嵌入到其中。

要做到这点，能够应用 f-strings。示例：

>>> name = 'IBM'
>>> shares = 100
>>> price = 91.1
>>> f'{shares} shares of {name} at ${price:0.2f}'
'100 shares of IBM at $91.10'
>>>

从练习 1.10 中批改 mortgage.py 程序来应用 f-strings 创立它的输入。

尝试实现它，使得输入可能很好地对齐。

根本字符串操作的一个局限性在于它们不反对任何类型的高级模式匹配。为此，你须要应用 Python 的 re 模块和正则表达式。正则表达式解决是一个大的主题，这里只是一个简短的示例：

>>> text = 'Today is 3/27/2018. Tomorrow is 3/28/2018.'
>>> # Find all occurrences of a date
>>> import re
>>> re.findall(r'\d+/\d+/\d+', text)
['3/27/2018', '3/28/2018']
>>> # Replace all occurrences of a date with replacement text
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2018-3-27. Tomorrow is 2018-3-28.'
>>>

无关 re 模块的更多信息，请查看官网文档：https://docs.python.org/library/re.html。

当你开始尝试应用解释器时，你总是心愿理解更多无关不同对象所反对的操作。例如，如何找出哪些操作是对字符串是无效的？

依据你的 Python 环境，你可能能够通过 tab 键补全来查看可用办法的列表。例如，尝试输出上面的代码：

>>> s = 'hello world'
>>> s.<tab key>
>>>

如果单击 tab 键没有任何作用，你能够应用 Python 的内建函数 dir()。示例：

>>> s = 'hello'
>>> dir(s)
['__add__', '__class__', '__contains__', ..., 'find', 'format',
'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace',
'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition',
'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit',
'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
'title', 'translate', 'upper', 'zfill']
>>>

dir() 函数生成一个在 (.) 后呈现的所有操作的列表。

应用 help() 函数能够获取无关特定操作的更多信息。

>>> help(s.upper)
Help on built-in function upper:

upper(...)
    S.upper() -> string

    Return a copy of the string S converted to uppercase.
>>>

目录 | 上一节 (1.3 数字) | 下一节 (1.5 列表)

注：残缺翻译见 https://github.com/codists/practical-python-zh

关于python:翻译实用的Python编程0104Strings

1.4 字符串

示意字面量文本

字符串本义码

字符串示意

字符串索引

字符串操作

字符串的办法

字符串的可变性

字符串转换

字节字符串

原始字符串

f-Strings

练习

练习 1.13：提取单个字符和子串

练习 1.14：字符串拼接

练习 1.15：成员测试（子串测试）

练习 1.16：字符串办法

练习 1.17：f-strings

练习 1.18：正则表达式

阐明