关于人工智能:Python文件和操作系统基础

54次阅读

共计 3987 个字符，预计需要花费 10 分钟才能阅读完成。

文章和代码等曾经归档至【Github 仓库：https://github.com/timerring/dive-into-AI】或者【AIShareLab】回复 python 数据分析 也可获取。

代码示例大多应用诸如 pandas.read_csv 之类的高级工具将磁盘上的数据文件读入 Python 数据结构。但咱们还是须要理解一些无关 Python 文件解决方面的基础知识。

为了关上一个文件以便读写，能够应用内置的 open 函数以及一个绝对或相对的文件门路：

In [207]: path = 'examples/segismundo.txt'

In [208]: f = open(path)

默认状况下，文件是以只读模式（’r’）关上的。而后，咱们就能够像解决列表那样来解决这个文件句柄 f 了，比方对行进行迭代：

for line in f:
    pass

从文件中取出的行都带有残缺的行结束符（EOL），因而你经常会看到上面这样的代码（失去一组没有 EOL 的行）：

In [209]: lines = [x.rstrip() for x in open(path)]

In [210]: lines
Out[210]: 
['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '','sueña el pobre que padece','su miseria y su pobreza;','',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '','y en el mundo, en conclusión,','todos sueñan lo que son,','aunque ninguno lo entiende.','']

如果应用 open 创立文件对象，肯定要用 close 敞开它。敞开文件能够返回操作系统资源：

In [211]: f.close()

用with 语句能够更容易地清理关上的文件：

In [212]: with open(path) as f:
   .....:     lines = [x.rstrip() for x in f]

这样能够在退出代码块时，主动敞开文件。

如果输出 f =open(path,’w’)，就会有一个新文件被创立在 examples/segismundo.txt，并 笼罩掉该地位原来的任何数据。另外有一个 x 文件模式，它能够创立可写的文件，然而如果文件门路存在，就无奈创立。表 3 - 3 列出了所有的读 / 写模式。

对于可读文件，一些罕用的办法是 read、seek 和 tell。read 会从文件返回字符。字符的内容是由文件的编码决定的（如 UTF-8），如果是二进制模式关上的就是原始字节：

In [213]: f = open(path)

In [214]: f.read(10)
Out[214]: 'Sueña el r'

In [215]: f2 = open(path, 'rb')  # Binary mode

In [216]: f2.read(10)
Out[216]: b'Sue\xc3\xb1a el'

read 模式会将文件句柄的地位提前，提前的数量是读取的字节数。tell 能够给出以后的地位：

In [217]: f.tell()
Out[217]: 11

In [218]: f2.tell()
Out[218]: 10

只管咱们从文件读取了 10 个字符，地位却是 11，这是因为用默认的编码用了这么多字节才解码了这 10 个字符。你能够用 sys 模块查看默认的编码：

In [219]: import sys

In [220]: sys.getdefaultencoding()
Out[220]: 'utf-8'

seek 将文件地位更改为文件中的指定字节：

In [221]: f.seek(3)
Out[221]: 3

In [222]: f.read(1)
Out[222]: 'ñ'

最初，敞开文件：

In [223]: f.close()

In [224]: f2.close()

向文件写入，能够应用文件的 write 或 writelines 办法。例如，咱们能够创立一个无空行版的 prof_mod.py：

In [225]: with open('tmp.txt', 'w') as handle:
   .....:     handle.writelines(x for x in open(path) if len(x) > 1)

In [226]: with open('tmp.txt') as f:
   .....:     lines = f.readlines()

In [227]: lines
Out[227]: 
['Sueña el rico en su riqueza,\n',
 'que más cuidados le ofrece;\n',
 'sueña el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueña el que a medrar empieza,\n',
 'sueña el que afana y pretende,\n',
 'sueña el que agravia y ofende,\n',
 'y en el mundo, en conclusión,\n',
 'todos sueñan lo que son,\n',
 'aunque ninguno lo entiende.\n']

表 3 - 4 列出了一些最罕用的文件办法。

Python 文件的默认操作是“文本模式”，也就是说，你须要解决 Python 的字符串（即 Unicode）。它与“二进制模式”绝对，文件模式加一个 b。咱们来看上一节的文件（UTF- 8 编码、蕴含非 ASCII 字符）：

In [230]: with open(path) as f:
   .....:     chars = f.read(10)

In [231]: chars
Out[231]: 'Sueña el r'

UTF- 8 是长度可变的 Unicode 编码，所以当我从文件申请肯定数量的字符时，Python 会从文件读取足够多（可能少至 10 或多至 40 字节）的字节进行解码。如果以“rb”模式关上文件，则读取确切的申请字节数：

In [232]: with open(path, 'rb') as f:
   .....:     data = f.read(10)

In [233]: data
Out[233]: b'Sue\xc3\xb1a el'

取决于文本的编码，你能够将字节解码为 str 对象，但只有当每个编码的 Unicode 字符都齐全成形时能力这么做：

In [234]: data.decode('utf8')
Out[234]: 'Sueña el'

In [235]: data[:4].decode('utf8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-235-300e0af10bb7> in <module>()
----> 1 data[:4].decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: unexpecte
d end of data

文本模式联合了 open 的编码选项，提供了一种更不便的办法将 Unicode 转换为另一种编码：

In [236]: sink_path = 'sink.txt'

In [237]: with open(path) as source:
   .....:     with open(sink_path, 'xt', encoding='iso-8859-1') as sink:
   .....:         sink.write(source.read())

In [238]: with open(sink_path, encoding='iso-8859-1') as f:
   .....:     print(f.read(10))
Sueña el r

留神，不要在二进制模式中应用 seek。如果文件地位位于定义 Unicode 字符的字节的两头地位，读取前面会产生谬误：

In [240]: f = open(path)

In [241]: f.read(5)
Out[241]: 'Sueña'

In [242]: f.seek(4)
Out[242]: 4

In [243]: f.read(1)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-243-7841103e33f5> in <module>()
----> 1 f.read(1)
/miniconda/envs/book-env/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid s
tart byte

In [244]: f.close()

如果你常常要对非 ASCII 字符文本进行数据分析，精通 Python 的 Unicode 性能是十分重要的。更多内容，参阅 Python 官网文档。

正文完