当前位置：日志文章 > 详细内容

Python使用urllib模块处理网络请求和URL的操作指南

2025年07月16日 • Python •

引言在 python 中，urllib 是一个标准库模块，用于处理 url（统一资源定位符）相关的操作，包括发送 http 请求、解析 url、处理查询参数以及管理 url 编码等。urllib 模块

引言

在 python 中，urllib 是一个标准库模块，用于处理 url（统一资源定位符）相关的操作，包括发送 http 请求、解析 url、处理查询参数以及管理 url 编码等。urllib 模块由多个子模块组成，提供了从基础到高级的网络功能，适用于爬虫、api 调用、文件下载等场景。虽然 urllib 功能强大，但对于复杂任务，开发者可能更倾向于使用第三方库如 requests。

以下是对 python urllib 模块的详细介绍，包括其子模块、功能、用法、示例、应用场景、最佳实践和注意事项。

1. urllib 模块简介

urllib 模块是 python 标准库的一部分（无需额外安装），主要用于处理网络请求和 url 操作。它由以下四个子模块组成：

urllib.request：用于发送 http/https 请求，获取网络资源。
urllib.error：定义网络请求相关的异常（如 http 错误、url 错误）。
urllib.parse：用于解析和操作 url（如拆分、编码查询参数）。
urllib.robotparser：用于解析 robots.txt 文件，检查爬虫权限。

1.1 主要特点

标准库：无需安装，适合轻量级网络任务。
功能全面：支持 http/https 请求、url 解析、查询参数编码、爬虫规则检查。
跨平台：在 linux、macos、windows 上运行一致。
基础性：适合简单场景，复杂任务可结合 requests 或 aiohttp。

1.2 安装

urllib 是 python 标准库的一部分，支持 python 2.7 和 3.x（本文以 python 3.9+ 为例）。

1.3 导入

import urllib.request
import urllib.error
import urllib.parse
import urllib.robotparser

2. urllib 的子模块和功能

以下详细介绍 urllib 的四个子模块及其核心功能。

2.1 urllib.request

用于发送 http/https 请求，获取网页内容、下载文件等。

核心功能

urllib.request.urlopen(url, data=none, timeout=none)：打开 url，返回响应对象。
- url：url 字符串或 request 对象。
- data：post 请求的数据（需为字节类型）。
- timeout：超时时间（秒）。
urllib.request.request(url, data=none, headers={})：创建自定义请求对象，支持添加头信息。
urllib.request.urlretrieve(url, filename=none)：下载 url 内容到本地文件。

示例（简单 get 请求）

import urllib.request

# 发送 get 请求
with urllib.request.urlopen("https://api.github.com") as response:
    content = response.read().decode("utf-8")
    print(content[:100])  # 输出: github api 响应（json 格式）

示例（post 请求）

import urllib.request
import urllib.parse

# 准备 post 数据
data = urllib.parse.urlencode({"name": "alice", "age": 30}).encode("utf-8")
req = urllib.request.request("https://httpbin.org/post", data=data, method="post")

with urllib.request.urlopen(req) as response:
    print(response.read().decode("utf-8"))  # 输出: post 数据响应

示例（下载文件）

import urllib.request

urllib.request.urlretrieve("https://example.com/image.jpg", "image.jpg")
print("file downloaded")

2.2 urllib.error

处理网络请求中的异常。

常见异常

urlerror：url 相关错误（如网络连接失败、域名无效）。
httperror：http 状态码错误（如 404、500），是 urlerror 的子类。

示例（异常处理）

import urllib.request
import urllib.error

try:
    with urllib.request.urlopen("https://example.com/nonexistent") as response:
        print(response.read().decode("utf-8"))
except urllib.error.httperror as e:
    print(f"http error: {e.code} - {e.reason}")  # 输出: http error: 404 - not found
except urllib.error.urlerror as e:
    print(f"url error: {e.reason}")  # 输出: url 相关错误

2.3 urllib.parse

用于解析、构造和编码 url。

核心功能

urllib.parse.urlparse(url)：解析 url 为组件（如协议、主机、路径）。
urllib.parse.urlunparse(components)：从组件构造 url。
urllib.parse.urlencode(query)：将字典编码为查询字符串。
urllib.parse.quote(string)：对字符串进行 url 编码。
urllib.parse.unquote(string)：解码 url 编码的字符串。

示例（解析 url）

import urllib.parse

url = "https://example.com/path?name=alice&age=30#section"
parsed = urllib.parse.urlparse(url)
print(parsed)
# 输出: parseresult(scheme='https', netloc='example.com', path='/path', params='', query='name=alice&age=30', fragment='section')

示例（构造查询字符串）

import urllib.parse

query = {"name": "alice", "age": 30}
encoded = urllib.parse.urlencode(query)
print(encoded)  # 输出: name=alice&age=30

# 构造完整 url
url = f"https://example.com?{encoded}"
print(url)  # 输出: https://example.com?name=alice&age=30

示例（url 编码）

import urllib.parse

path = "path with spaces"
encoded = urllib.parse.quote(path)
print(encoded)  # 输出: path%20with%20spaces
print(urllib.parse.unquote(encoded))  # 输出: path with spaces

2.4 urllib.robotparser

用于解析网站的 robots.txt 文件，检查爬虫是否允许访问特定 url。

核心功能

robotfileparser：解析 robots.txt 文件。
can_fetch(user_agent, url)：检查指定用户代理是否允许访问 url。

示例

import urllib.robotparser

rp = urllib.robotparser.robotfileparser()
rp.set_url("https://example.com/robots.txt")
rp.read()
print(rp.can_fetch("*", "https://example.com/allowed"))  # 输出: true 或 false

3. 实际应用场景

3.1 网页爬取

使用 urllib.request 获取网页内容，结合 urllib.parse 处理 url。

示例：

import urllib.request
import urllib.parse

base_url = "https://httpbin.org/get"
params = urllib.parse.urlencode({"q": "python"})
url = f"{base_url}?{params}"

with urllib.request.urlopen(url) as response:
    print(response.read().decode("utf-8"))  # 输出: json 响应

3.2 api 调用

发送 get 或 post 请求调用 rest api。

示例（调用 github api）：

import urllib.request
import json

req = urllib.request.request(
    "https://api.github.com/users/octocat",
    headers={"accept": "application/json"}
)
with urllib.request.urlopen(req) as response:
    data = json.loads(response.read().decode("utf-8"))
    print(data["login"])  # 输出: octocat

3.3 文件下载

使用 urlretrieve 下载文件。

示例：

import urllib.request

urllib.request.urlretrieve("https://www.python.org/static/img/python-logo.png", "python_logo.png")

3.4 检查爬虫权限

使用 urllib.robotparser 确保爬虫符合网站规则。

示例：

import urllib.robotparser

rp = urllib.robotparser.robotfileparser("https://python.org/robots.txt")
rp.read()
print(rp.can_fetch("mybot", "/dev"))  # 检查是否允许爬取

4. 最佳实践

始终处理异常：

使用 try-except 捕获 httperror 和 urlerror。
示例：

try:
    urllib.request.urlopen("https://invalid-url")
except urllib.error.urlerror as e:
    print(f"failed: {e}")

使用上下文管理器：

使用 with 语句确保响应对象正确关闭。
示例：

with urllib.request.urlopen("https://example.com") as response:
    content = response.read()

设置请求头：

添加 user-agent 和其他头信息，避免被服务器拒绝。
示例：

req = urllib.request.request(
    "https://example.com",
    headers={"user-agent": "mozilla/5.0"}
)

参数化 url：

使用 urllib.parse.urlencode 构造查询参数。
示例：

params = urllib.parse.urlencode({"q": "python tutorial"})
url = f"https://example.com/search?{params}"

测试网络操作：

使用 pytest 测试请求和解析逻辑，结合 unittest.mock 模拟响应。
示例：

import pytest
from unittest.mock import patch

def test_urlopen():
    with patch("urllib.request.urlopen") as mocked:
        mocked.return_value.__enter__.return_value.read.return_value = b"mocked data"
        with urllib.request.urlopen("https://example.com") as response:
            assert response.read() == b"mocked data"

考虑使用 requests：

对于复杂任务（如会话管理、json 解析），考虑使用 requests 库。
示例：

import requests
response = requests.get("https://api.github.com")
print(response.json())

5. 注意事项

版本要求：

urllib 在 python 3.x 中分为子模块，python 2 的 urllib 和 urllib2 已合并。
示例（python 2 兼容）：

# python 2
import urllib2
response = urllib2.urlopen("https://example.com")

编码处理：

urllib.request 返回字节数据，需手动解码（如 decode("utf-8")）。
urllib.parse.urlencode 要求数据为字符串，post 数据需编码为字节。
示例：

data = urllib.parse.urlencode({"key": "value"}).encode("utf-8")

超时设置：

始终设置 timeout 参数，避免请求挂起。
示例：

urllib.request.urlopen("https://example.com", timeout=5)

性能问题：

urllib.request 适合简单任务，复杂场景（如并发请求）使用 aiohttp 或 httpx。
示例（异步请求）：

import aiohttp
async def fetch():
    async with aiohttp.clientsession() as session:
        async with session.get("https://example.com") as response:
            return await response.text()

安全性：

使用 https 协议，避免明文传输。
验证 ssl 证书，防止中间人攻击：

import ssl
context = ssl.create_default_context()
urllib.request.urlopen("https://example.com", context=context)

6. 总结

python 的 urllib 模块是处理 url 和网络请求的标准库工具，包含四个子模块：

urllib.request：发送 http/https 请求，下载文件。
urllib.error：处理请求异常。
urllib.parse：解析和编码 url。
urllib.robotparser：解析 robots.txt。

其核心特点包括：

简单易用：适合轻量级网络任务。
应用场景：网页爬取、api 调用、文件下载、爬虫规则检查。
最佳实践：异常处理、上下文管理器、设置请求头、参数化 url。

虽然 urllib 功能强大，但对于复杂场景（如会话管理、异步请求），建议使用 requests 或 aiohttp。

以上就是python使用urllib模块处理网络请求和url的操作指南的详细内容，更多关于python urllib处理网络请求和url的资料请关注代码网其它相关文章！

点击排行

江湖微商城究竟有什么用?四个场景帮你解析

docker进阶教程之dockerfile优化镜像大小

颜值逆天不足7500元锐龙5-2600配GTX1066白色主机推荐

ipados16.2更新了什么？ipados16.2更新内容介绍

mx450显卡相当于GTX什么级别 mx450显卡性能一览

GDDR6X和GDDR6区别是什么 GDDR6X和GDDR6对比介绍

Mac新手:如何让电脑每隔一段时间为你报时?

荣耀平板7怎么样荣耀平板7详细评测

什么是等离子显示器及其成像原理、工作原理介绍

MySQL8.0新特性之不可见主键的使用