Python自动检测requests所获得html文档的编码_Python

使用chardet库自动检测requests所获得html文档的编码

使用requests和beautifulsoup库获取某个页面带来的乱码问题

使用requests配合beautifulsoup库，可以轻松地从网页中提取数据。但是，当网页返回的编码格式与python默认的编码格式不一致时，就会导致乱码问题。

以如下代码为例，它会获取到一段乱码的html：

import requests
from bs4 import beautifulsoup

# 目标 url
url = 'https://finance.sina.com.cn/realstock/company/sh600050/nc.shtml'

# 发送 http get 请求
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:

    # 获取网页内容
    html_content = response.text
    
    # 使用 beautifulsoup 解析 html 内容
    soup = beautifulsoup(html_content, 'html.parser')
    
    # 要查找的 id
    target_id = 'hqdetails'
    
    # 查找具有特定 id 的标签
    element = soup.find(id=target_id)
    
    if element:
        # 获取该标签下的 html 内容
        element_html = str(element)
        print(f"id 为 {target_id} 的 html 内容:\n{element_html}\n")
        
        # 查找该标签下的所有 table 元素
        tables = element.find_all('table')
        
        if tables:
            for i, table in enumerate(tables):
                print(f"第 {i+1} 个 table 的 html 内容:\n{table}\n")
        else:
            print(f"id 为 {target_id} 的标签下没有 table 元素")
    else:
        print(f"未找到 id 为 {target_id} 的标签")
else:
    print(f"请求失败，状态码: {response.status_code}")

我们可以通过通过手工指定代码的方式来解决这个问题，例如在response.status_code == 200后，通过response.encoding = 'utf-8'指定代码，又或通过soup = beautifulsoup(html_content, 'html.parser', from_encoding='utf-8') 来指定编码。

然而，当我们获取的html页面编码不确定的时候，有没有更好的办法让编码监测自动执行呢？这时候chardet编码监测库是一个很好的帮手。

使用 chardet 库自动检测编码

chardet 是一个用于自动检测字符编码的库，可以更准确地检测响应的编码。

安装chardet库

pip install chardet

代码应用示例

import requests
from bs4 import beautifulsoup
import chardet

# 目标 url
url = 'https://finance.sina.com.cn/realstock/company/sh600050/nc.shtml'

# 发送 http get 请求
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    # 自动检测字符编码
    detected_encoding = chardet.detect(response.content)['encoding']
    
    # 设置响应的编码
    response.encoding = detected_encoding

    # 获取网页内容
    html_content = response.text
    
    # 使用 beautifulsoup 解析 html 内容
    soup = beautifulsoup(html_content, 'html.parser')
    
    # 要查找的 id
    target_id = 'hqdetails'
    
    # 查找具有特定 id 的标签
    element = soup.find(id=target_id)
    
    if element:
        # 获取该标签下的 html 内容
        element_html = str(element)
        print(f"id 为 {target_id} 的 html 内容:\n{element_html}\n")
        
        # 查找该标签下的所有 table 元素
        tables = element.find_all('table')
        
        if tables:
            for i, table in enumerate(tables):
                print(f"第 {i+1} 个 table 的 html 内容:\n{table}\n")
        else:
            print(f"id 为 {target_id} 的标签下没有 table 元素")
    else:
        print(f"未找到 id 为 {target_id} 的标签")
else:
    print(f"请求失败，状态码: {response.status_code}")

可见，通过使用chardet库，可以有效实现代码的自动检测。

以上就是python自动检测requests所获得html文档的编码的详细内容，更多关于python检测requests获得html文档编码的资料请关注代码网其它相关文章！

鸿蒙Navigation拦截器实现页面跳转登录鉴权方案详解

我们在进行页面跳转时，很多情况下都得考虑登录状态问题，比如进入个人信息页面，下单交易页面等等。在这些场景下，通常在页面跳转前，会先判断下用户是否已经登录，若已登... [阅读全文]

Python将Word文档转换为Markdown格式

markdown作为一种轻量级标记语言，以其简洁的语法和广泛的兼容性，特别适合用于博客、技术文档和版本控制系统中的内容管理。而word文档则因其强大的排版功能，... [阅读全文]

Python实现QR码的代码详解

1. qr码的基本概念与历史在这一部分，我们将介绍qr码的背景知识，包括它的历史、结构和应用场景：qr码的定义：qr码是一种二维条形码，全称为“qu... [阅读全文]

Python实现缓存的两个简单方法

缓存是一种用于提高应用程序性能的技术，它通过临时存储程序获得的结果，以便在以后需要时重用它们。在本文中，我们将学习python中的不同缓存技术，包括functo... [阅读全文]

python使用tkinter包实现进度条

python中的tkinter包是一种常见的设计程序的gui界面用的包。本文主要介绍这里面的一个组件：进度条（progressbar）。tkinter progressbar里面对…

2024年11月18日 • 前端脚本

使用Python自动备份重要文件

在数字化时代，数据是非常宝贵的资源。从个人照片和文档到重要的工作文件，我们的数字资产对我们来说越来越重要。因此，确保这些文件的安全就显得尤为关键。本文将引导您如... [阅读全文]


验证码：

验证码：

Python自动检测requests所获得html文档的编码

2024年11月19日 • Python •我要评论

相关文章:

python使用tkinter包实现进度条

发表评论