• Unicode 字符表:https://en.wikibooks.org/wiki...
  • \xa0 是 NO-Break Space,不间断空格
  • \xad 是 Soft Hyphen,软连接符,常被显示为短横或者空格

可打印字符

>>> '你好'.isprintable()True>>> '\x41'.isprintable()True>>> '\xa0'.isprintable()False>>> '\xad'.isprintable()False

UTF8

>>> import codecs>>> utf8_decoder = codecs.getincrementaldecoder('utf8')()>>> utf8_decoder.decode(b'hello')'hello'>>> utf8_decoder.decode(b'\x41')'A'>>> utf8_decoder.decode(b'\xa0')UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte

regex

>>> regex.findall(r"[\p{L}]", '水_A_\x41_\xa0_\xad_0 地')['水', 'A', 'A', '地']>>> regex.findall(r"[\p{Z}]", '水_A_\x41_\xa0_\xad_0 地')['\xa0', ' ']>>>regex.findall(r"[\p{C}]", '水_A_\x41_\xa0_\xad_0 地')['\xad']

pandahouse

  • pandahouse 解决 \xad 之类的非常规字符会有问题
本文出自 qbit snap