Python遇到UnicodeEncodeError错误的解决方案_Python

技术背景

在使用python处理从不同网页抓取的文本时，经常会遇到unicodeencodeerror错误。这通常是因为python默认使用ascii编码，而当遇到超出ascii编码范围（0 - 127）的字符时，就会抛出该错误。例如，在使用beautifulsoup解析网页时，不同网页的编码方式可能不同，有些网页可能包含非ascii字符，如u'\xa0'，这就会导致编码错误。

实现步骤

1. 避免使用str()进行编码转换

在python中，不要直接使用str()将unicode字符串转换为字节字符串，因为str()会使用默认的ascii编码进行转换，可能会引发unicodeencodeerror。可以使用.encode()方法显式指定编码方式。

# 错误示例
# agent_contact = u'contact'
# agent_telno = u'\xa0123456'
# p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

# 正确示例
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

2. 设置环境变量

可以通过设置pythonioencoding环境变量来指定python的输入输出编码。在执行脚本之前，可以在shell中设置该变量：

export pythonioencoding=utf8

3. 忽略或替换非ascii字符

如果不需要处理非ascii字符，可以选择忽略它们。可以使用encode('ascii', 'ignore')方法来忽略非ascii字符：

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

如果想保留一个占位符来表示被替换的字符，可以使用replace选项：

yourstring = yourstring.encode('ascii', 'replace').decode('ascii')

4. 检查并设置系统环境变量

确保系统的环境变量设置正确，特别是lang和lc_all。可以通过以下命令来设置：

export lc_all='en_us.utf8'

5. 指定文件编码

在打开文件时，指定文件的编码方式：

open(foo, encoding='utf-8')

核心代码

示例1：使用.encode()方法

agent_contact = u'contact'
agent_telno = u'\xa0123456'
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

示例2：忽略非ascii字符

yourstring = u'city: malmö'
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
print(yourstring)  # 输出: city: malm

示例3：设置环境变量

export pythonioencoding=utf8
python your_script.py

最佳实践

使用“unicode三明治”原则：在程序中，尽量在输入时将数据解码为unicode字符串，在处理过程中使用unicode字符串，在输出时将unicode字符串编码为所需的字节字符串。
明确指定编码方式：在打开文件、进行网络请求等操作时，明确指定编码方式，避免使用默认的ascii编码。
处理异常：在进行编码转换时，使用try-except语句捕获unicodeencodeerror异常，并进行相应的处理。