使用C#高效解析HTML的实战指南_C#

一、为什么要在 c# 中解析 html

在实际项目中，无论是进行网页数据采集、网页内容分析，还是开发网页爬虫，都离不开对 html 的解析。例如，电商平台可能需要从竞品网站上采集商品价格和库存信息；新闻聚合应用可能需要从各大新闻网站提取文章标题、正文和发布时间。通过 c# 解析 html，能够自动化地获取这些关键数据，大大提高工作效率。

二、c# 解析 html 的常用工具和库

htmlagilitypack：这是 c# 中最常用的 html 解析库之一，它提供了简单易用的 api，能够将 html 文档解析成一个 dom（文档对象模型）树，方便开发者通过 xpath 或 css 选择器来提取节点和属性。

anglesharp：另一个功能强大的 html 解析库，支持现代的 html5 标准，并且在性能上表现出色。它同样可以构建 dom 树，同时还提供了丰富的事件处理机制，方便处理复杂的网页结构。

三、使用 htmlagilitypack 解析 html

安装库：最简单的方式是通过 nuget 包管理器。在 visual studio 中，右键点击项目，选择 “管理 nuget 程序包”，搜索 “htmlagilitypack” 并安装。

基本解析示例：下面是一个使用 htmlagilitypack 从 html 字符串中提取所有链接的代码示例：

 
using htmlagilitypack;
 
class program
 
{
 
static void main()
 
{
 
string html = "<html><body><a href='https://www.example.com'>example link</a></body></html>";
 
htmldocument doc = new htmldocument();
 
doc.loadhtml(html);
 
htmlnodecollection links = doc.documentnode.selectnodes("//a");
 
if (links!= null)
 
{
 
foreach (htmlnode link in links)
 
{
 
string href = link.getattributevalue("href", "");
 
console.writeline($"link: {href}");
 
}
 
}
 
}
 
}

在这段代码中，首先创建了一个htmldocument对象并加载 html 字符串。然后使用selectnodes方法结合 xpath 表达式//a来选取所有的<a>标签节点，最后遍历这些节点并提取href属性的值。

提取复杂结构的数据：假设我们要从一个电商网页中提取商品信息，包括商品名称、价格和图片链接。html 结构可能如下：

 
<div class="product">
 
<img src="product1.jpg" alt="product name">
 
<h2 class="product-name">product 1</h2>
 
<span class="price">$19.99</span>
 
</div>

使用 htmlagilitypack 提取数据的代码如下：

 
using htmlagilitypack;
 
class product
 
{
 
public string name { get; set; }
 
public string price { get; set; }
 
public string imageurl { get; set; }
 
}
 
class program
 
{
 
static void main()
 
{
 
string html = "<div class='product'><img src='product1.jpg' alt='product name'><h2 class='product-name'>product 1</h2><span class='price'>$19.99</span></div>";
 
htmldocument doc = new htmldocument();
 
doc.loadhtml(html);
 
htmlnode productnode = doc.documentnode.selectsinglenode("//div[@class='product']");
 
if (productnode!= null)
 
{
 
product product = new product();
 
htmlnode imgnode = productnode.selectsinglenode(".//img");
 
if (imgnode!= null)
 
{
 
product.imageurl = imgnode.getattributevalue("src", "");
 
}
 
htmlnode namenode = productnode.selectsinglenode(".//h2[@class='product-name']");
 
if (namenode!= null)
 
{
 
product.name = namenode.innertext;
 
}
 
htmlnode pricenode = productnode.selectsinglenode(".//span[@class='price']");
 
if (pricenode!= null)
 
{
 
product.price = pricenode.innertext;
 
}
 
console.writeline($"name: {product.name}, price: {product.price}, imageurl: {product.imageurl}");
 
}
 
}
 
}

这里使用selectsinglenode方法结合 xpath 表达式来精确选取需要的节点，并提取相应的属性和文本内容。

四、使用 anglesharp 解析 html

安装库：同样通过 nuget 包管理器搜索并安装 “anglesharp”。

基本解析示例：使用 anglesharp 提取所有链接的代码如下：

 
using anglesharp;
 
using system.threading.tasks;
 
class program
 
{
 
static async task main()
 
{
 
string html = "<html><body><a href='https://www.example.com'>example link</a></body></html>";
 
var context = browsingcontext.new();
 
var document = await context.openasync(req => req.content(html));
 
var links = document.queryselectorall("a");
 
foreach (var link in links)
 
{
 
string href = link.getattribute("href");
 
console.writeline($"link: {href}");
 
}
 
}
 
}

在这段代码中，通过browsingcontext.new()创建一个浏览上下文，然后使用openasync方法加载 html 字符串并得到一个idocument对象。接着使用queryselectorall方法结合 css 选择器来选取所有的<a>标签，最后提取href属性。