共计 345 个字符,预计需要花费 1 分钟才能阅读完成。
今天遇到一个需求,需要对表格数据进行数据清洗,其中有的字符是法语字符。就是这类àéêö,和咱们的注音差不多。我们的目标是将这类法语字符转换成英文字符。
我们需要用到 unidecode,如下面代码
1 | <table class = "hljs-ln" ><tbody><tr><td class = "hljs-ln-line hljs-ln-numbers" data - line - number = "1" ><div class = "hljs-ln-n" data - line - number = "1" >< / div>< / td><td class = "hljs-ln-line hljs-ln-code" data - line - number = "1" ><span class = "hljs-keyword" > import < / span> unicodedata< / td>< / tr><tr><td class = "hljs-ln-line hljs-ln-numbers" data - line - number = "2" ><div class = "hljs-ln-n" data - line - number = "2" >< / div>< / td><td class = "hljs-ln-line hljs-ln-code" data - line - number = "2" > < / td>< / tr><tr><td class = "hljs-ln-line hljs-ln-numbers" data - line - number = "3" ><div class = "hljs-ln-n" data - line - number = "3" >< / div>< / td><td class = "hljs-ln-line hljs-ln-code" data - line - number = "3" ><span class = "hljs-keyword" > def < / span> <span class = "hljs-title function_" >strip_accents< / span>(<span class = "hljs-params" >text< / span>):< / td>< / tr><tr><td class = "hljs-ln-line hljs-ln-numbers" data - line - number = "4" ><div class = "hljs-ln-n" data - line - number = "4" >< / div>< / td><td class = "hljs-ln-line hljs-ln-code" data - line - number = "4" > < / td>< / tr><tr><td class = "hljs-ln-line hljs-ln-numbers" data - line - number = "5" ><div class = "hljs-ln-n" data - line - number = "5" >< / div>< / td><td class = "hljs-ln-line hljs-ln-code" data - line - number = "5" > <span class = "hljs-keyword" > try < / span>:< / td>< / tr><tr><td class = "hljs-ln-line hljs-ln-numbers" data - line - number = "6" ><div class = "hljs-ln-n" data - line - number = "6" >< / div>< / td><td class = "hljs-ln-line hljs-ln-code" data - line - number = "6" > text = unicode (text, <span class = "hljs-string" > 'utf-8' < / span>)< / td>< / tr><tr><td class = "hljs-ln-line hljs-ln-numbers" data - line - number = "7" ><div class = "hljs-ln-n" data - line - number = "7" >< / div>< / td><td class = "hljs-ln-line hljs-ln-code" data - line - number = "7" > <span class = "hljs-keyword" > except < / span> NameError: <span class = "hljs-comment" > # unicode is a default on python 3 </span></td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="8"><div class="hljs-ln-n" data-line-number="8"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="8"> <span class="hljs-keyword">pass</span></td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="9"><div class="hljs-ln-n" data-line-number="9"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="9"> </td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="10"><div class="hljs-ln-n" data-line-number="10"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="10"> text = unicodedata.normalize(<span class="hljs-string">'NFD'</span>, text)\</td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="11"><div class="hljs-ln-n" data-line-number="11"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="11"> .encode(<span class="hljs-string">'ascii'</span>, <span class="hljs-string">'ignore'</span>)\</td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="12"><div class="hljs-ln-n" data-line-number="12"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="12"> .decode(<span class="hljs-string">"utf-8"</span>)</td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="13"><div class="hljs-ln-n" data-line-number="13"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="13"> </td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="14"><div class="hljs-ln-n" data-line-number="14"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="14"> <span class="hljs-keyword">return</span> <span class="hljs-built_in">str</span>(text)</td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="15"><div class="hljs-ln-n" data-line-number="15"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="15"> </td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="16"><div class="hljs-ln-n" data-line-number="16"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="16">s = strip_accents(<span class="hljs-string">'àéêöhello'</span>)</td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="17"><div class="hljs-ln-n" data-line-number="17"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="17"> </td></tr><tr><td class="hljs-ln-line hljs-ln-numbers" data-line-number="18"><div class="hljs-ln-n" data-line-number="18"></div></td><td class="hljs-ln-line hljs-ln-code" data-line-number="18"><span class="hljs-built_in">print</span>(s)</td></tr></tbody></table> |
正文完