Python实现将PDF转换为HTML的完整指南_Python

在许多场景中，我们可能需要将 pdf 文档转换为 html 文件。这在报表生成、网页发布以及自动化处理等场景中非常常见。尤其当 pdf 中包含结构化内容、文字、图片或表格时，将其转换为 html 可以更好地展示内容、增强交互性，或方便集成到网站和 web 应用中。

与 pdf 相比，html 提供了更灵活的内容展示环境。一旦转换完成，内容可以通过 css 进行样式美化，嵌入网页中，并通过 javascript 实现交互操作。在本文中，我们将详细介绍如何使用 python 将 pdf 转换为 html，并提供实用示例。主要内容包括：

使用 python 将整个 pdf 转换为 html
提取 pdf 中的特定页面并导出为 html
导出 pdf 中的特定页面区域为 html
仅将 pdf 中的表格转换为 html

为什么要将 pdf 转换为 html？

将 pdf 转换为 html 并非只是格式变化，它能够为内容展示和后续处理带来许多优势。以下是开发者和内容创作者选择 pdf 转 html 的几个主要原因：

web 友好：html 文件在浏览器中加载速度更快，且能够适应不同屏幕尺寸，无论是桌面端还是移动端都能良好显示。
交互性增强：pdf 中的链接、表格和图片可以变为动态 web 内容，实现用户交互和数据展示。
数据提取与处理：将 pdf 转为 html 后，开发者可以更方便地操作数据，进行分析或二次处理。
分享更便捷：html 文件通常比 pdf 更小，可直接嵌入网站，无需下载即可查看。

准备工作

要在 python 中将 pdf 转换为 html，需要借助 pdf 处理库。本文示例中使用 spire.pdf for python，它可以在无需 adobe acrobat 或其他外部 pdf 软件的情况下，将 pdf 文档或部分内容导出为 html。

在开始之前，请先通过 pypi 安装库：

pip install spire.pdf

注意：请确保你的 python 版本为 3.7 或更高。

使用 python 将 pdf 转换为 html

有时我们需要将整个 pdf 文档展示在网页中。转换整个 pdf 可以确保所有文字、图片和格式都得到保留。

步骤：

使用 pdfdocument.loadfromfile() 加载 pdf。
使用 savetofile() 保存为 html。
关闭 pdf，释放资源。

示例：

from spire.pdf import *

# 创建 pdfdocument 实例
pdf = pdfdocument()

# 加载 pdf 文件
pdf.loadfromfile("sample.pdf")  # 替换为你的 pdf 路径

# 将 pdf 转换为 html
pdf.savetofile("output.html", fileformat.html)

# 关闭 pdf 文档释放资源
pdf.close()

print("整个 pdf 已成功转换为 html。")

提取 pdf 中的特定页面并导出为 html

有时，我们只需要 pdf 中部分页面的内容。转换整个 pdf 不仅不必要，还可能生成较大的 html 文件。此时可以选择提取特定页面进行转换。

步骤：

使用 pdfdocument.loadfromfile() 加载原始 pdf。
创建新的 pdfdocument 用于保存所需页面。
将目标页面添加到新 pdf。
保存新 pdf 为 html 文件。

示例（只转换第 2、3 页）：

from spire.pdf import *

# 加载原始 pdf
pdf = pdfdocument()
pdf.loadfromfile("sample.pdf")

# 创建新 pdf，用于保存所选页面
selectedpdf = pdfdocument()

# 添加第 2 和第 3 页（索引从 0 开始）
selectedpdf.pages.add(pdf.pages[1])
selectedpdf.pages.add(pdf.pages[2])

# 将选定页面转换为 html
selectedpdf.savetofile("selected_pages.html", fileformat.html)

# 关闭 pdf
pdf.close()
selectedpdf.close()

print("已成功将选定页面转换为 html。")

导出 pdf 中的特定页面区域为 html

有时只需要页面的一部分内容，比如图表、图片或特定文字区域。可以在导出前先裁剪页面。

步骤：

打开原始 pdf。
获取目标页面，并使用 cropbox 定义矩形区域。
创建新的 pdf，并插入裁剪后的页面。
使用 savetofile() 导出 html。
关闭 pdf，释放资源。

示例：

from spire.pdf import *
from spire.pdf.graphics import pointf, rectanglef, sizef

# step 1: 加载 pdf
pdf = pdfdocument()
pdf.loadfromfile("sample.pdf")

# step 2: 裁剪第一页到指定区域
page = pdf.pages[0]
page.cropbox = rectanglef(pointf(30.0, 280.0), sizef(552.0, 220.0))

# step 3: 创建新 pdf 保存裁剪后的页面
new_pdf = pdfdocument()
new_pdf.insertpage(pdf, 0, 0)

# step 4: 保存裁剪后的页面为 html
new_pdf.savetofile("page_area.html", fileformat.html)

# step 5: 关闭 pdf
new_pdf.close()
pdf.close()

print("指定页面区域已成功导出为 html。")

仅将 pdf 中的表格转换为 html

pdf 中经常包含结构化表格，有时我们只需要表格内容。可以提取表格，并手动生成 html <table> 元素。

步骤：

加载 pdf。
遍历页面并使用 extracttables() 提取表格。
手动构建 html <table>。
保存 html 文件。

示例：

from spire.pdf import *

# 加载 pdf
pdf = pdfdocument()
pdf.loadfromfile("sample.pdf")

# 初始化 html 内容
html_content = "<html><body>"

# 遍历页面提取表格
for i in range(pdf.pages.count):
    page = pdf.pages[i]
    tables = page.extracttables()
    
    for table in tables:
        html_content += "<table border='1'>"
        for row in table:
            html_content += "<tr>"
            for cell in row:
                html_content += f"<td>{cell}</td>"
            html_content += "</tr>"
        html_content += "</table><br>"

html_content += "</body></html>"

# 保存为 html 文件
with open("tables_only.html", "w", encoding="utf-8") as f:
    f.write(html_content)

pdf.close()

print("pdf 表格已成功转换为 html。")

这样每个表格在 html 中都会被单独保存为 <table> 元素，特别适合财务报表、发票或数据表格为主的 pdf 文件。