python编码转换与中文处理

python

发布日期: 2017-08-23

文章字数: 387

关于字符的处理

python文件的编码

python 脚本文件默认都是采用ANSCII编码的，当文件中有非ANSCII编码范围内的字符时，需要“编码指示”来修正module中的定义，比如# --coding=utf-8--
或者#coding=utf-8

python中有两种编码方式，分别是str和unicode (在python3中好像只有一种unicode，str也是unicode)

The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file.

也就是说在读取一个文件的内容时，保持的对象为str类型，如果想把str转换成其他类型，需要先将其编码为unicode然后再转换为其他类型

s = u'中国'
s_gb = s.encode('gb2312')
s_gb

b'\xd6\xd0\xb9\xfa'

s = u'中国'
s_utf8 = s.encode('utf-8')
assert(s_utf8.decode('utf-8') == s) #assert断言语句为raise-if-not，用来测试表示式，其返回值为假，就会触发异常

#coding=UTF-8
s = '中国'
su = u'中国'
s_unicode = s.encode('utf-8')
assert(s == su)

s = '中国'
s.encode('gb2312')

b'\xd6\xd0\xb9\xfa'

# coding=gbk
print (open('test.txt').read())

abc中文

使用chardet可以很方便的实现字符串的编码检测

import urllib.request
import chardet
url = urllib.request.urlopen('http://www.google.cn/').read()
chardet.detect(url)

{'confidence': 0.99, 'encoding': 'utf-8', 'language': ''}

在转换编码时经常会遇到非法字符，解决办法:

s.decode('gbk','ignore').encode('utf-8')

lovelyfrog

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 lovelyfrog !

python

jieba包的使用方法 jieba中文分词的用法特点支持三种分词模式：精确模式：试图将句子最精确的切开，适合文本分析全模式：把句子中所有可以成词的词语都扫描出来，但不能解决歧义搜索引擎模式：将精确模式的基础上，对长词语再次切分，

2017-08-24 python

python

关于类的一些高级操作 class Chain(object): def __init__(self, path=''): self._path = path def __getattr__(self, p

2017-08-22 python

python