html 结构解析是 web 爬虫中的核心技能之一,它允许你从网页中提取所需的信息。python 提供了几种流行的库来帮助进行 html 解析,其中最常用的是 beautifulsoup
和 lxml
。
1. 安装必要的库
首先,你需要安装 requests
(用于发送 http 请求)和 beautifulsoup4
(用于解析 html)。可以通过 pip 安装:
pip install requests beautifulsoup4
2. 发送 http 请求并获取 html 内容
使用 requests
库可以轻松地从网站抓取 html 页面:
import requests
url = "https://www.example.com"
response = requests.get(url)
# 检查请求是否成功
if response.status_code == 200:
html_content = response.text
else:
print(f"failed to retrieve page, status code: {response.status_code}")
3. 解析 html 内容
接下来,使用 beautifulsoup
解析 html 内容:
from bs4 import beautifulsoup
soup = beautifulsoup(html_content, 'html.parser')
这里的 'html.parser'
是解析器的名字,beautifulsoup
支持多种解析器,包括 python 自带的标准库、lxml
和 html5lib
。
4. 选择和提取信息
一旦你有了 beautifulsoup
对象,你可以开始提取信息。以下是几种常见的选择器方法:
-
通过标签名:
titles = soup.find_all('h1')
-
通过类名:
articles = soup.find_all('div', class_='article')
-
通过 id:
main_content = soup.find(id='main-content')
-
通过属性:
links = soup.find_all('a', href=true)
-
组合选择器:
article_titles = soup.select('div.article h2.title')
5. 遍历和处理数据
提取到数据后,你可以遍历并处理它们:
for title in soup.find_all('h2'):
print(title.text.strip())
6. 递归解析
对于复杂的嵌套结构,你可以使用递归函数来解析:
def parse_section(section):
title = section.find('h2')
if title:
print(title.text.strip())
sub_sections = section.find_all('section')
for sub_section in sub_sections:
parse_section(sub_section)
sections = soup.find_all('section')
for section in sections:
parse_section(section)
7. 实战示例
让我们创建一个完整的示例,抓取并解析一个简单的网页:
import requests
from bs4 import beautifulsoup
url = "https://www.example.com"
# 发送请求并解析 html
response = requests.get(url)
soup = beautifulsoup(response.text, 'html.parser')
# 找到所有的文章标题
article_titles = soup.find_all('h2', class_='article-title')
# 输出所有文章标题
for title in article_titles:
print(title.text.strip())
这个示例展示了如何从网页中抓取所有具有 class="article-title"
的 h2
元素,并打印出它们的文本内容。
以上就是使用 python 和 beautifulsoup 进行 html 结构解析的基本流程。当然,实际应用中你可能需要处理更复杂的逻辑,比如处理 javascript 渲染的内容或者分页等。
在我们已经讨论的基础上,让我们进一步扩展代码,以便处理更复杂的场景,比如分页、错误处理、日志记录以及数据持久化。我们将继续使用 requests
和 beautifulsoup
,并引入 logging
和 sqlite3
来记录日志和存储数据。
1. 异常处理和日志记录
在爬取过程中,可能会遇到各种问题,如网络错误、服务器错误或解析错误。使用 try...except
块和 logging
模块可以帮助我们更好地处理这些问题:
import logging
import requests
from bs4 import beautifulsoup
logging.basicconfig(filename='crawler.log', level=logging.info, format='%(asctime)s:%(levelname)s:%(message)s')
def fetch_data(url):
try:
response = requests.get(url)
response.raise_for_status() # raises an httperror for bad responses
soup = beautifulsoup(response.text, 'html.parser')
return soup
except requests.exceptions.requestexception as e:
logging.error(f"failed to fetch {url}: {e}")
return none
# example usage
url = 'https://www.example.com'
soup = fetch_data(url)
if soup:
# proceed with parsing...
else:
logging.info("no data fetched, skipping...")
2. 分页处理
许多网站使用分页显示大量数据。你可以通过检查页面源码找到分页链接的模式,并编写代码来遍历所有页面:
def fetch_pages(base_url, page_suffix='page/'):
current_page = 1
while true:
url = f"{base_url}{page_suffix}{current_page}"
soup = fetch_data(url)
if not soup:
break
# process page data here...
# check for next page link
next_page_link = soup.find('a', text='next')
if not next_page_link:
break
current_page += 1
3. 数据持久化:sqlite
使用数据库存储爬取的数据可以方便后续分析和检索。sqlite 是一个轻量级的数据库,非常适合小型项目:
import sqlite3
def init_db():
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('''
create table if not exists articles (
id integer primary key autoincrement,
title text not null,
author text,
published_date date
)
''')
conn.commit()
return conn
def save_article(conn, title, author, published_date):
cursor = conn.cursor()
cursor.execute('''
insert into articles (title, author, published_date) values (?, ?, ?)
''', (title, author, published_date))
conn.commit()
# initialize database
conn = init_db()
# save data
save_article(conn, "example title", "author name", "2024-07-24")
4. 完整示例:抓取分页数据并保存到 sqlite
让我们将上述概念整合成一个完整的示例,抓取分页数据并将其保存到 sqlite 数据库:
import logging
import requests
from bs4 import beautifulsoup
import sqlite3
logging.basicconfig(filename='crawler.log', level=logging.info)
def fetch_data(url):
try:
response = requests.get(url)
response.raise_for_status()
return beautifulsoup(response.text, 'html.parser')
except requests.exceptions.requestexception as e:
logging.error(f"failed to fetch {url}: {e}")
return none
def fetch_pages(base_url, page_suffix='page/'):
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('''
create table if not exists articles (
id integer primary key autoincrement,
title text not null,
author text,
published_date date
)
''')
conn.commit()
current_page = 1
while true:
url = f"{base_url}{page_suffix}{current_page}"
soup = fetch_data(url)
if not soup:
break
# assume the structure of the site allows us to find titles easily
titles = soup.find_all('h2', class_='article-title')
for title in titles:
save_article(conn, title.text.strip(), none, none)
next_page_link = soup.find('a', text='next')
if not next_page_link:
break
current_page += 1
conn.close()
def save_article(conn, title, author, published_date):
cursor = conn.cursor()
cursor.execute('''
insert into articles (title, author, published_date) values (?, ?, ?)
''', (title, author, published_date))
conn.commit()
# example usage
base_url = 'https://www.example.com/articles/'
fetch_pages(base_url)
这个示例将抓取 https://www.example.com/articles/
上的分页数据,保存文章标题到 sqlite 数据库。注意,你需要根据实际网站的 html 结构调整 find_all
和 find
方法的参数。
既然我们已经有了一个基本的框架来抓取分页数据并存储到 sqlite 数据库中,现在让我们进一步完善这个代码,包括添加更详细的错误处理、日志记录、以及处理动态加载的网页内容(通常由 javascript 渲染)。
1. 更详细的错误处理
在 fetch_data
函数中,除了处理请求错误之外,我们还可以捕获和记录其他可能发生的错误,比如解析 html 的错误:
def fetch_data(url):
try:
response = requests.get(url)
response.raise_for_status()
soup = beautifulsoup(response.text, 'html.parser')
return soup
except requests.exceptions.requestexception as e:
logging.error(f"request error fetching {url}: {e}")
except exception as e:
logging.error(f"an unexpected error occurred: {e}")
return none
2. 更详细的日志记录
在日志记录方面,我们可以增加更多的信息,比如请求的 http 状态码、响应时间等:
import time
def fetch_data(url):
try:
start_time = time.time()
response = requests.get(url)
elapsed_time = time.time() - start_time
response.raise_for_status()
soup = beautifulsoup(response.text, 'html.parser')
logging.info(f"fetched {url} successfully in {elapsed_time:.2f} seconds, status code: {response.status_code}")
return soup
except requests.exceptions.requestexception as e:
logging.error(f"request error fetching {url}: {e}")
except exception as e:
logging.error(f"an unexpected error occurred: {e}")
return none
3. 处理动态加载的内容
当网站使用 javascript 动态加载内容时,普通的 http 请求无法获取完整的内容。这时可以使用 selenium
或 pyppeteer
等库来模拟浏览器行为。这里以 selenium
为例:
from selenium import webdriver
from selenium.webdriver.chrome.options import options
def fetch_data_with_js(url):
options = options()
options.headless = true # run chrome in headless mode
driver = webdriver.chrome(options=options)
driver.get(url)
# add wait time or wait for certain elements to load
time.sleep(3) # wait for dynamic content to load
html = driver.page_source
driver.quit()
return beautifulsoup(html, 'html.parser')
要使用这段代码,你需要先下载 chromedriver
并确保它在系统路径中可执行。此外,你还需要安装 selenium
库:
pip install selenium
4. 整合所有改进点
现在,我们可以将上述所有改进点整合到我们的分页数据抓取脚本中:
import logging
import time
import requests
from bs4 import beautifulsoup
import sqlite3
from selenium import webdriver
from selenium.webdriver.chrome.options import options
logging.basicconfig(filename='crawler.log', level=logging.info)
def fetch_data(url):
try:
start_time = time.time()
response = requests.get(url)
elapsed_time = time.time() - start_time
response.raise_for_status()
soup = beautifulsoup(response.text, 'html.parser')
logging.info(f"fetched {url} successfully in {elapsed_time:.2f} seconds, status code: {response.status_code}")
return soup
except requests.exceptions.requestexception as e:
logging.error(f"request error fetching {url}: {e}")
except exception as e:
logging.error(f"an unexpected error occurred: {e}")
return none
def fetch_data_with_js(url):
options = options()
options.headless = true
driver = webdriver.chrome(options=options)
driver.get(url)
time.sleep(3)
html = driver.page_source
driver.quit()
return beautifulsoup(html, 'html.parser')
def fetch_pages(base_url, page_suffix='page/', use_js=false):
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('''
create table if not exists articles (
id integer primary key autoincrement,
title text not null,
author text,
published_date date
)
''')
conn.commit()
current_page = 1
fetch_function = fetch_data_with_js if use_js else fetch_data
while true:
url = f"{base_url}{page_suffix}{current_page}"
soup = fetch_function(url)
if not soup:
break
titles = soup.find_all('h2', class_='article-title')
for title in titles:
save_article(conn, title.text.strip(), none, none)
next_page_link = soup.find('a', text='next')
if not next_page_link:
break
current_page += 1
conn.close()
def save_article(conn, title, author, published_date):
cursor = conn.cursor()
cursor.execute('''
insert into articles (title, author, published_date) values (?, ?, ?)
''', (title, author, published_date))
conn.commit()
# example usage
base_url = 'https://www.example.com/articles/'
use_js = true # set to true if the site uses js for loading content
fetch_pages(base_url, use_js=use_js)
这个改进版的脚本包含了错误处理、详细的日志记录、以及处理动态加载内容的能力,使得它更加健壮和实用。
发表评论