引言
在日常办公和软件开发中,我们经常需要处理文档格式转换的需求。特别是word文档(docx)与json数据之间的相互转换,在自动化报告生成、内容管理系统和数据迁移等场景中尤为重要。本文将详细介绍一个增强版的python工具,它可以实现docx与json之间的高质量双向转换,支持样式、列表、表格、图片等复杂元素的完整保留。
与传统的简单文本提取不同,本工具致力于保持文档的完整结构和格式样式,包括段落格式、字体样式、表格布局甚至复选框等表单控件。这种转换能力对于需要保持文档专业外观的企业环境至关重要。
核心功能概述
这个增强版转换器提供了以下核心功能:
- 双向转换支持:既可以将docx文档转换为结构化的json数据,也可以将json数据还原为格式完整的docx文档
- 样式完整性保持:支持段落样式、字符样式、表格样式等的提取和还原
- 复杂元素处理:能够处理列表、表格、图片、复选框等复杂文档元素
- 批量处理能力:支持文件夹批量转换,提高工作效率
- 文档元数据保留:保留文档的章节信息、页面设置等元数据
与在线转换工具相比,本方案提供了更高的数据安全性和自定义灵活性,所有处理均在本地完成,无需上传敏感文档到第三方服务器。
核心代码解析
1. 文档到json的转换机制
docx_to_json函数是整个转换过程的核心,它通过系统性地解析docx文档的各个组成部分,构建完整的结构化json数据:
def docx_to_json(docx_path):
document = document(docx_path)
doc_data = {
"metadata": {"created_by": "docx_converter", "version": "2.0"},
"styles": {"paragraph_styles": [], "character_styles": [], "table_styles": []},
"paragraphs": [],
"tables": [],
"images": [],
"sections": []
}
# 提取各个元素
extract_styles(document, doc_data)
extract_paragraphs(document, doc_data)
extract_tables(document, doc_data)
extract_images(document, doc_data)
extract_sections(document, doc_data)
return doc_data
这种模块化的设计使得代码易于维护和扩展,每个提取函数负责处理特定类型的文档元素。
2. 样式提取技术
样式提取是保持文档格式的关键。extract_styles函数深入分析文档中的样式定义:
def extract_styles(document, doc_data):
styles = document.styles
for style in styles:
style_info = {
"name": style.name,
"type": str(style.type),
"builtin": style.builtin,
# 其他属性...
}
# 提取字体样式
if hasattr(style, 'font') and style.font:
font_info = {}
if style.font.name: font_info["name"] = style.font.name
if style.font.size: font_info["size"] = style.font.size.pt
# 更多字体属性...
这种方法确保了即使是复杂的样式信息也能被完整捕获,为高质量文档还原奠定基础。
3. 段落和文本处理
段落处理不仅关注文本内容,还包括格式、列表属性和内嵌元素:
def extract_paragraphs(document, doc_data):
for para_idx, paragraph in enumerate(document.paragraphs):
para_info = {}
# 文本内容
if paragraph.text.strip():
para_info["text"] = paragraph.text
# 段落样式
if paragraph.style and paragraph.style.name:
para_info["style"] = paragraph.style.name
# 列表检测
list_info = detect_list_properties(paragraph)
if list_info:
para_info["list_info"] = list_info
# 处理文本运行(runs)
runs_list = []
for run in paragraph.runs:
run_info = extract_run_properties(run)
if run_info: runs_list.append(run_info)
if runs_list: para_info["runs"] = runs_list
doc_data["paragraphs"].append(para_info)
这种细粒度的处理方式确保了文档中格式变化的精确捕获,即使是同一段落内不同文本段的样式差异也能妥善保留。
4. 表格提取算法
表格提取是文档处理中的难点,本工具通过分层提取的方式确保表格结构的完整性:
def extract_tables(document, doc_data):
for table_idx, table in enumerate(document.tables):
table_info = {"index": table_idx, "rows": []}
# 表格样式
if hasattr(table, 'style') and table.style:
table_info["style"] = table.style.name
# 处理行和单元格
for row_idx, row in enumerate(table.rows):
row_info = {"index": row_idx, "cells": []}
for cell_idx, cell in enumerate(row.cells):
cell_info = extract_cell_content(cell, row_idx, cell_idx)
if cell_info: row_info["cells"].append(cell_info)
table_info["rows"].append(row_info)
doc_data["tables"].append(table_info)
表格中的每个单元格都会进一步解析其中的段落和运行,确保嵌套内容的完整性。
5. 图片和多媒体处理
图片处理采用base64编码的方式,将二进制图像数据转换为文本格式存储在json中:
def extract_images(document, doc_data):
for rel in document.part.rels.values():
if "image" in rel.reltype:
image_part = rel.target_part
image_info = {
"content_type": image_part.content_type,
"data": base64.b64encode(image_part.blob).decode('utf-8'),
"filename": getattr(image_part, 'filename', 'image.png')
}
doc_data["images"].append(image_info)
这种方法确保了图片数据的无损保存,在文档还原时能够完全恢复原始图像质量。
应用场景与实战案例
1. 自动化报告生成
本工具在自动化报告生成场景中表现出色,例如可以将json格式的业务数据自动填充到预设的docx模板中,生成具有一致格式的业务报告。
# 示例:将业务数据转换为格式化的报告
business_data = {
"title": "季度销售报告",
"period": "2023年q1",
"metrics": ["销售额", "增长率", "市场份额"],
"values": [1500000, 0.15, 0.23]
}
# 使用模板生成正式报告
json_to_docx(business_data, "report_template.docx", "季度销售报告.docx")
2. 内容管理系统集成
对于内容管理系统(cms),本工具可以实现内容的结构化存储和灵活发布。编辑人员可以在word中方便地编辑内容,然后转换为json格式存储到数据库中,发布时再转换为html或pdf等多种格式。
3. 法律和合规文档处理
在法律行业,合同和协议文档需要严格的格式控制。使用本工具可以确保文档在多次转换后仍保持格式完整性,避免因格式错误导致的法律效力问题。
4. 教育与科研应用
在学术研究中,研究者可以使用此工具批量处理实验报告,提取结构化数据进行分析,或者将数据分析结果自动填充到论文模板中。
与其他工具的对比
与市场上其他文档转换工具相比,本方案具有独特优势:
| 特性 | 本工具 | 在线转换工具 | 专业软件 |
|---|---|---|---|
| 数据隐私 | 本地处理,完全私有 | 需上传文档到服务器 | 取决于部署方式 |
| 自定义程度 | 高,代码可任意修改 | 低,功能固定 | 中等,依赖软件接口 |
| 格式支持 | 专注docx与json互转 | 支持多种格式 | 支持多种格式 |
| 成本 | 免费开源 | 免费或付费 | 通常需要付费 |
与简单的文本提取工具相比,本工具在样式保持方面表现卓越;与复杂的商业软件相比,它具有开源透明的优势。
使用教程
环境准备
首先安装必要的python依赖库:
pip install python-docx
python-docx是处理word文档的核心库,提供了丰富的api来操作docx文件的各个方面。
基本使用示例
将docx转换为json:
from docx_converter import docx_to_json
# 转换单个文档
json_data = docx_to_json("我的文档.docx")
# 保存json结果
import json
with open("文档数据.json", "w", encoding="utf-8") as f:
json.dump(json_data, f, ensure_ascii=false, indent=2)
将json还原为docx:
from docx_converter import json_to_docx
# 读取json数据
with open("文档数据.json", "r", encoding="utf-8") as f:
json_data = json.load(f)
# 还原为word文档
json_to_docx(json_data, "还原的文档.docx")
批量转换:
import os
def batch_convert(folder_path):
for filename in os.listdir(folder_path):
if filename.endswith(".docx"):
docx_path = os.path.join(folder_path, filename)
json_data = docx_to_json(docx_path)
json_path = os.path.join(folder_path, filename.replace(".docx", ".json"))
with open(json_path, "w", encoding="utf-8") as f:
json.dump(json_data, f, ensure_ascii=false, indent=2)
高级功能使用
复选框检测:
from docx_converter import find_all_checkboxes
# 检测文档中的复选框
results = find_all_checkboxes("表单文档.docx")
print(f"找到 {len(results['checked'])} 个已选中复选框")
print(f"找到 {len(results['unchecked'])} 个未选中复选框")
样式自定义:
# 自定义转换样式映射
def custom_style_mapper(style_info):
# 修改或过滤特定样式
if style_info.get('name') == 'heading1':
style_info['font_size'] = 16 # 修改标题1的字号
return style_info
注意事项与最佳实践
1. 文件路径处理
在处理文件路径时,始终使用绝对路径并添加适当的错误处理:
import os
def safe_convert(docx_path):
if not os.path.exists(docx_path):
raise filenotfounderror(f"文档不存在: {docx_path}")
if not docx_path.endswith('.docx'):
raise valueerror("仅支持.docx格式文件")
try:
return docx_to_json(docx_path)
except exception as e:
print(f"转换失败: {str(e)}")
return none
2. 大文件处理优化
处理大型文档时,考虑内存使用优化:
def process_large_document(docx_path, chunk_size=10):
"""分块处理大型文档"""
document = document(docx_path)
total_paragraphs = len(document.paragraphs)
for i in range(0, total_paragraphs, chunk_size):
chunk_data = process_paragraph_chunk(document, i, i+chunk_size)
save_chunk(chunk_data, i)
3. 样式一致性维护
为了确保样式一致性,建议使用模板文档:
def create_from_template(json_data, template_path, output_path):
"""基于模板创建文档"""
template_data = docx_to_json(template_path)
# 将数据应用到模板样式
merged_data = merge_data_with_template(json_data, template_data)
json_to_docx(merged_data, output_path)
扩展与自定义
本工具的设计允许轻松扩展以支持更多功能:
1. 添加新元素支持
def extract_custom_elements(document, doc_data):
"""提取自定义元素"""
# 添加对图表、数学公式等特殊元素的提取逻辑
pass
def create_custom_elements(document, element_data):
"""创建自定义元素"""
pass
2. 集成其他格式支持
结合pandoc等工具,可以扩展更多格式支持:
def convert_via_markdown(json_data):
"""通过markdown中间格式转换"""
# json -> markdown -> 目标格式
markdown_content = json_to_markdown(json_data)
# 使用pandoc转换为其他格式
return markdown_content
3. 云服务集成
将工具部署为web服务,提供api接口:
from flask import flask, request, send_file
app = flask(__name__)
@app.route('/convert/docx-to-json', methods=['post'])
def convert_docx_to_json_api():
file = request.files['file']
json_data = docx_to_json(file)
return json_data
这种架构允许与其他系统轻松集成。
总结
本文详细介绍了一个功能丰富的docx与json双向转换工具的实现原理和应用方法。通过这个工具,用户可以实现文档内容的结构化提取和精确还原,满足各种文档自动化处理需求。
与现有解决方案相比,本工具的主要优势在于:
- 格式保持完整性:支持样式、表格、图片等复杂元素的精确转换
- 灵活的可扩展性:模块化设计便于添加新功能
- 开源免费:基于mit许可证,可自由使用和修改
- 本地化处理:确保敏感数据不会离开本地环境
随着数字化进程的加速,文档自动化处理的需求将不断增长。本工具为开发者提供了一个强大的基础,可以在此基础上构建更复杂的文档处理流程,如与langchain等ai工具集成实现智能文档处理。
未来,我们将继续优化工具性能,添加对更多元素的支持,并探索与人工智能技术的深度融合,使文档处理更加智能化、自动化。
资源推荐
- 完整代码:本文涉及的完整代码已在github上开源
- 示例文档:提供多种测试文档,演示不同场景下的转换效果
- 扩展模块:社区贡献的扩展功能,如pdf支持、ocr集成等
希望本文能帮助您更好地理解和应用文档转换技术,提升工作效率和自动化水平。
完整代码
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
docx to json and json to docx converter
可以将docx文件的所有样式抽取成为json对象,也可以将json对象还原为docx文件
增强版:支持更多样式、列表、图片、表格样式等
"""
import json
import base64
import os
from docx import document
from docx.enum.text import wd_align_paragraph, wd_break
from docx.enum.style import wd_style_type
from docx.enum.table import wd_table_alignment
from docx.shared import rgbcolor, pt, inches
from docx.oxml.ns import qn
from docx.oxml import oxmlelement
import io
def docx_to_json(docx_path):
"""
将docx文件转换为json格式
增强版:支持更多样式属性、列表、图片等
"""
document = document(docx_path)
# 存储所有内容的字典
doc_data = {
"metadata": {
"created_by": "docx_converter",
"version": "2.0"
},
"styles": {
"paragraph_styles": [],
"character_styles": [],
"table_styles": []
},
"paragraphs": [],
"tables": [],
"images": [],
"sections": []
}
# 1. 提取所有样式
extract_styles(document, doc_data)
# 2. 提取段落内容
extract_paragraphs(document, doc_data)
# 3. 提取表格内容
extract_tables(document, doc_data)
# 4. 提取图片
extract_images(document, doc_data)
# 5. 提取章节信息
extract_sections(document, doc_data)
return doc_data
def extract_styles(document, doc_data):
"""提取文档中的所有样式"""
styles = document.styles
for style in styles:
style_info = {
"name": style.name,
"type": str(style.type),
"builtin": style.builtin,
"hidden": style.hidden,
"priority": getattr(style, 'priority', none)
}
# 字体样式 - 只有当style有font属性时才提取
if hasattr(style, 'font') and style.font:
font_info = {}
if style.font.name:
font_info["name"] = style.font.name
if style.font.size:
font_info["size"] = style.font.size.pt
if style.font.bold is not none:
font_info["bold"] = style.font.bold
if style.font.italic is not none:
font_info["italic"] = style.font.italic
if style.font.underline is not none:
font_info["underline"] = str(style.font.underline)
if style.font.color.rgb:
font_info["color"] = str(style.font.color.rgb)
if style.font.all_caps is not none:
font_info["all_caps"] = style.font.all_caps
if style.font.small_caps is not none:
font_info["small_caps"] = style.font.small_caps
if style.font.superscript is not none:
font_info["superscript"] = style.font.superscript
if style.font.subscript is not none:
font_info["subscript"] = style.font.subscript
if style.font.strike is not none:
font_info["strike"] = style.font.strike
if font_info:
style_info["font"] = font_info
# 段落格式 - 仅对段落样式提取
if style.type == wd_style_type.paragraph and hasattr(style, 'paragraph_format') and style.paragraph_format:
pf_info = extract_paragraph_format(style.paragraph_format)
if pf_info:
style_info["paragraph_format"] = pf_info
# 根据样式类型分类存储
if style.type == wd_style_type.paragraph:
doc_data["styles"]["paragraph_styles"].append(style_info)
elif style.type == wd_style_type.character:
doc_data["styles"]["character_styles"].append(style_info)
elif style.type == wd_style_type.table:
doc_data["styles"]["table_styles"].append(style_info)
def extract_paragraph_format(paragraph_format):
"""提取段落格式信息"""
pf_info = {}
if paragraph_format.alignment is not none:
pf_info["alignment"] = str(paragraph_format.alignment)
if paragraph_format.left_indent:
pf_info["left_indent"] = paragraph_format.left_indent.pt
if paragraph_format.right_indent:
pf_info["right_indent"] = paragraph_format.right_indent.pt
if paragraph_format.first_line_indent:
pf_info["first_line_indent"] = paragraph_format.first_line_indent.pt
if paragraph_format.space_before:
pf_info["space_before"] = paragraph_format.space_before.pt
if paragraph_format.space_after:
pf_info["space_after"] = paragraph_format.space_after.pt
if paragraph_format.line_spacing and paragraph_format.line_spacing <= 100:
pf_info["line_spacing"] = paragraph_format.line_spacing
if paragraph_format.keep_with_next is not none:
pf_info["keep_with_next"] = paragraph_format.keep_with_next
if paragraph_format.keep_together is not none:
pf_info["keep_together"] = paragraph_format.keep_together
if paragraph_format.page_break_before is not none:
pf_info["page_break_before"] = paragraph_format.page_break_before
if paragraph_format.widow_control is not none:
pf_info["widow_control"] = paragraph_format.widow_control
if paragraph_format.line_spacing_rule is not none:
pf_info["line_spacing_rule"] = str(paragraph_format.line_spacing_rule)
# 提取制表符信息
try:
if paragraph_format.tab_stops:
tab_stops_info = []
for tab_stop in paragraph_format.tab_stops:
tab_info = {
"position": tab_stop.position.pt if tab_stop.position else none,
"alignment": str(tab_stop.alignment) if tab_stop.alignment else none,
"leader": str(tab_stop.leader) if tab_stop.leader else none
}
tab_stops_info.append(tab_info)
if tab_stops_info:
pf_info["tab_stops"] = tab_stops_info
except:
pass
return pf_info if pf_info else none
def extract_paragraphs(document, doc_data):
"""提取所有段落内容"""
for para_idx, paragraph in enumerate(document.paragraphs):
para_info = {}
# 基本文本和样式
if paragraph.text.strip():
para_info["text"] = paragraph.text
if paragraph.style and paragraph.style.name:
para_info["style"] = paragraph.style.name
# 段落格式
if paragraph.paragraph_format:
pf_info = extract_paragraph_format(paragraph.paragraph_format)
if pf_info:
para_info["paragraph_format"] = pf_info
# 检测列表属性
list_info = detect_list_properties(paragraph)
if list_info:
para_info["list_info"] = list_info
# 处理runs
runs_list = []
for run in paragraph.runs:
run_info = extract_run_properties(run)
if run_info:
runs_list.append(run_info)
# 处理复选框
checkbox_info = extract_checkboxes(paragraph)
if checkbox_info:
runs_list.append(checkbox_info)
if runs_list:
para_info["runs"] = runs_list
# 只有包含内容的段落才添加
if para_info:
doc_data["paragraphs"].append(para_info)
def detect_list_properties(paragraph):
"""检测段落中的列表属性"""
list_info = {}
try:
pf = paragraph.paragraph_format
# 检测项目符号列表
if hasattr(pf, 'bullet_char') and pf.bullet_char is not none:
list_info['type'] = 'bullet'
list_info['bullet_char'] = pf.bullet_char
list_info['level'] = getattr(pf, 'level', 0)
# 检测编号列表
elif hasattr(pf, 'number_format') and pf.number_format is not none:
list_info['type'] = 'number'
list_info['number_format'] = str(pf.number_format)
list_info['level'] = getattr(pf, 'level', 0)
list_info['start_value'] = getattr(pf, 'start_value', 1)
# 通过样式名检测列表
elif paragraph.style and paragraph.style.name:
style_name = paragraph.style.name.lower()
if 'list' in style_name or 'bullet' in style_name:
list_info['type'] = 'style_based'
list_info['style_name'] = paragraph.style.name
except exception as e:
# 如果检测失败,忽略列表属性
pass
return list_info if list_info else none
def extract_run_properties(run):
"""提取run的样式属性"""
run_info = {}
if run.text.strip():
run_info["text"] = run.text
# 字体属性
font_props = [
("bold", run.bold),
("italic", run.italic),
("underline", run.underline),
("strike", run.font.strike),
("superscript", run.font.superscript),
("subscript", run.font.subscript),
("all_caps", run.font.all_caps),
("small_caps", run.font.small_caps)
]
for prop_name, prop_value in font_props:
if prop_value is not none:
run_info[prop_name] = prop_value
# 字体名称和大小
if run.font.name:
run_info["font_name"] = run.font.name
if run.font.size:
run_info["font_size"] = run.font.size.pt
# 颜色
if run.font.color.rgb:
run_info["color"] = str(run.font.color.rgb)
# 高亮颜色
try:
if run.font.highlight_color and str(run.font.highlight_color) != 'none':
run_info["highlight_color"] = str(run.font.highlight_color)
except:
pass
# 下划线颜色
try:
if run.font.underline_color and run.font.underline_color.rgb:
run_info["underline_color"] = str(run.font.underline_color.rgb)
except:
pass
# 字符间距
try:
if run.font.spacing:
run_info["character_spacing"] = run.font.spacing
except:
pass
# 字体背景色(字符底纹)
try:
rpr = run._element.rpr
if rpr is not none:
shd_elements = rpr.xpath('.//w:shd')
if shd_elements:
shd_element = shd_elements[0]
fill_color = shd_element.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}fill')
if fill_color:
run_info["background_color"] = fill_color
except:
pass
return run_info if run_info else none
def extract_checkboxes(paragraph):
"""提取复选框信息"""
try:
p_element = paragraph._element
xml_str = p_element.xml
# 检测传统复选框
if 'w:checkbox' in xml_str:
if 'w:checked="1"' in xml_str or 'w:checked w:val="true"' in xml_str:
return {"text": "[✓]", "is_checkbox": true, "checked": true}
else:
return {"text": "[□]", "is_checkbox": true, "checked": false}
# 检测新式复选框
checkboxes = p_element.xpath('.//*[local-name()="checkbox"]')
for checkbox in checkboxes:
checked_elements = checkbox.xpath('.//*[local-name()="checked"]')
if checked_elements:
checked_element = checked_elements[0]
checked_value = "false"
for attr_name in ['{http://schemas.microsoft.com/office/word/2010/wordml}val',
qn('w14:val'), 'w14:val']:
val = checked_element.get(attr_name)
if val is not none:
checked_value = val
break
is_checked = checked_value.lower() == "true" or checked_value == "1"
return {
"text": "[✓]" if is_checked else "[□]",
"is_checkbox": true,
"checked": is_checked
}
except exception as e:
pass
return none
def extract_tables(document, doc_data):
"""提取表格内容和样式"""
for table_idx, table in enumerate(document.tables):
table_info = {
"index": table_idx,
"rows": []
}
# 表格样式
if hasattr(table, 'style') and table.style:
table_info["style"] = table.style.name
# 表格对齐方式
if hasattr(table, 'alignment'):
table_info["alignment"] = str(table.alignment)
# 处理行和列
for row_idx, row in enumerate(table.rows):
row_info = {
"index": row_idx,
"cells": [],
"height": getattr(row, 'height', none)
}
for cell_idx, cell in enumerate(row.cells):
cell_info = extract_cell_content(cell, row_idx, cell_idx)
if cell_info:
row_info["cells"].append(cell_info)
if row_info["cells"]:
table_info["rows"].append(row_info)
if table_info["rows"]:
doc_data["tables"].append(table_info)
def extract_cell_content(cell, row_idx, cell_idx):
"""提取单元格内容"""
cell_info = {
"row": row_idx,
"column": cell_idx,
"text": cell.text
}
# 单元格样式
try:
# 底纹
if hasattr(cell, 'shading'):
shading = cell.shading
if hasattr(shading, 'background_pattern_color'):
cell_info["shading"] = str(shading.background_pattern_color)
# 垂直对齐
if hasattr(cell, 'vertical_alignment') and cell.vertical_alignment is not none:
cell_info["vertical_alignment"] = str(cell.vertical_alignment)
# 边距
if hasattr(cell, 'top_margin') and cell.top_margin is not none:
cell_info["top_margin"] = cell.top_margin.pt
if hasattr(cell, 'bottom_margin') and cell.bottom_margin is not none:
cell_info["bottom_margin"] = cell.bottom_margin.pt
if hasattr(cell, 'left_margin') and cell.left_margin is not none:
cell_info["left_margin"] = cell.left_margin.pt
if hasattr(cell, 'right_margin') and cell.right_margin is not none:
cell_info["right_margin"] = cell.right_margin.pt
# 单元格边框
tc = cell._tc
tcpr = tc.tcpr
if tcpr is not none:
tcborders = tcpr.xpath('./w:tcborders')
if tcborders:
borders_info = {}
border_elements = tcborders[0].xpath('./*')
for border_elem in border_elements:
border_tag = border_elem.tag.split('}')[1] # 获取标签名
border_attrs = {}
for attr, value in border_elem.attrib.items():
attr_name = attr.split('}')[1] if '}' in attr else attr
border_attrs[attr_name] = value
borders_info[border_tag] = border_attrs
if borders_info:
cell_info["borders"] = borders_info
except:
pass
# 处理单元格中的段落
paragraphs_list = []
for para in cell.paragraphs:
if para.text.strip():
para_dict = {
"text": para.text
}
if para.style and para.style.name:
para_dict["style"] = para.style.name
# 段落格式
if para.paragraph_format:
pf_info = extract_paragraph_format(para.paragraph_format)
if pf_info:
para_dict["paragraph_format"] = pf_info
# 处理runs
runs_list = []
for run in para.runs:
run_info = extract_run_properties(run)
if run_info:
runs_list.append(run_info)
if runs_list:
para_dict["runs"] = runs_list
paragraphs_list.append(para_dict)
if paragraphs_list:
cell_info["paragraphs"] = paragraphs_list
return cell_info
def extract_images(document, doc_data):
"""提取文档中的图片"""
try:
# 从文档关系中提取图片
for rel in document.part.rels.values():
if "image" in rel.reltype:
image_part = rel.target_part
image_info = {
"content_type": image_part.content_type,
"data": base64.b64encode(image_part.blob).decode('utf-8'),
"filename": getattr(image_part, 'filename', 'image.png')
}
doc_data["images"].append(image_info)
except exception as e:
print(f"提取图片时出错: {e}")
def extract_sections(document, doc_data):
"""提取章节信息"""
for section_idx, section in enumerate(document.sections):
section_info = {
"index": section_idx,
"page_width": section.page_width.pt if section.page_width else none,
"page_height": section.page_height.pt if section.page_height else none,
"left_margin": section.left_margin.pt if section.left_margin else none,
"right_margin": section.right_margin.pt if section.right_margin else none,
"top_margin": section.top_margin.pt if section.top_margin else none,
"bottom_margin": section.bottom_margin.pt if section.bottom_margin else none
}
doc_data["sections"].append(section_info)
def json_to_docx(json_data, output_path):
"""
将json数据转换为docx文件
增强版:支持更多样式和元素
"""
document = document()
# 1. 设置文档属性
setup_document_properties(document, json_data)
# 2. 添加段落
create_paragraphs(document, json_data)
# 3. 添加表格
create_tables(document, json_data)
# 4. 添加图片
create_images(document, json_data)
# 保存文档
document.save(output_path)
def setup_document_properties(document, json_data):
"""设置文档属性"""
# 设置页面布局
if json_data.get("sections"):
section = document.sections[0]
first_section = json_data["sections"][0]
if first_section.get("page_width"):
section.page_width = pt(first_section["page_width"])
if first_section.get("page_height"):
section.page_height = pt(first_section["page_height"])
if first_section.get("left_margin"):
section.left_margin = pt(first_section["left_margin"])
if first_section.get("right_margin"):
section.right_margin = pt(first_section["right_margin"])
if first_section.get("top_margin"):
section.top_margin = pt(first_section["top_margin"])
if first_section.get("bottom_margin"):
section.bottom_margin = pt(first_section["bottom_margin"])
def create_paragraphs(document, json_data):
"""创建段落"""
for para_data in json_data.get("paragraphs", []):
# 创建段落
style_name = para_data.get("style", "normal")
try:
paragraph = document.add_paragraph(style=style_name)
except:
paragraph = document.add_paragraph(style="normal")
# 设置段落格式
apply_paragraph_formatting(paragraph, para_data)
# 处理列表
apply_list_formatting(paragraph, para_data)
# 清空默认文本
paragraph.clear()
# 添加runs
create_runs(paragraph, para_data)
def apply_paragraph_formatting(paragraph, para_data):
"""应用段落格式"""
paragraph_format_data = para_data.get("paragraph_format", {})
if paragraph_format_data:
pf = paragraph.paragraph_format
# 对齐方式
alignment_str = paragraph_format_data.get("alignment")
if alignment_str:
if "left" in alignment_str:
paragraph.alignment = wd_align_paragraph.left
elif "center" in alignment_str:
paragraph.alignment = wd_align_paragraph.center
elif "right" in alignment_str:
paragraph.alignment = wd_align_paragraph.right
elif "justify" in alignment_str:
paragraph.alignment = wd_align_paragraph.justify
elif "distribute" in alignment_str:
paragraph.alignment = wd_align_paragraph.distribute
# 缩进和间距
indent_props = [
("left_indent", "left_indent"),
("right_indent", "right_indent"),
("first_line_indent", "first_line_indent"),
("space_before", "space_before"),
("space_after", "space_after")
]
for json_prop, pf_prop in indent_props:
if json_prop in paragraph_format_data:
setattr(pf, pf_prop, pt(paragraph_format_data[json_prop]))
if "line_spacing" in paragraph_format_data and paragraph_format_data["line_spacing"] <= 100:
pf.line_spacing = paragraph_format_data["line_spacing"]
# 应用制表符设置
if "tab_stops" in paragraph_format_data:
try:
tab_stops = pf.tab_stops
# 清除现有的制表符
for _ in range(len(tab_stops)):
tab_stops.pop()
# 添加新的制表符
for tab_info in paragraph_format_data["tab_stops"]:
position = pt(tab_info["position"]) if tab_info["position"] else none
if position:
alignment = none
leader = none
# 解析对齐方式
if tab_info.get("alignment"):
from docx.enum.text import wd_tab_alignment
if "left" in tab_info["alignment"]:
alignment = wd_tab_alignment.left
elif "right" in tab_info["alignment"]:
alignment = wd_tab_alignment.right
elif "center" in tab_info["alignment"]:
alignment = wd_tab_alignment.center
# 解析前导字符
if tab_info.get("leader"):
from docx.enum.text import wd_tab_leader
if "dots" in tab_info["leader"]:
leader = wd_tab_leader.dots
elif "hyphens" in tab_info["leader"]:
leader = wd_tab_leader.hyphens
elif "underscore" in tab_info["leader"]:
leader = wd_tab_leader.underscore
tab_stops.add_tab_stop(position, alignment, leader)
except:
pass
def apply_list_formatting(paragraph, para_data):
"""应用列表格式"""
list_info = para_data.get("list_info")
if list_info:
try:
pf = paragraph.paragraph_format
if list_info.get("type") == "bullet" and list_info.get("level") is not none:
# 设置项目符号列表
pf.left_indent = pt(list_info.get("level", 0) * 36)
elif list_info.get("type") == "number" and list_info.get("level") is not none:
# 设置编号列表
pf.left_indent = pt(list_info.get("level", 0) * 36)
except exception as e:
print(f"应用列表格式时出错: {e}")
def create_runs(paragraph, para_data):
"""创建runs"""
runs_data = para_data.get("runs", [])
if runs_data:
for run_data in runs_data:
text = run_data.get("text", "")
# 检查是否有重要内容
has_content = any([
text,
run_data.get("bold") is not none,
run_data.get("italic") is not none,
run_data.get("underline") is not none,
run_data.get("font_name"),
run_data.get("font_size"),
run_data.get("color"),
run_data.get("highlight_color")
])
if has_content:
run = paragraph.add_run(text)
apply_run_formatting(run, run_data)
else:
# 如果没有runs数据,直接添加段落文本
text = para_data.get("text", "")
if text:
run = paragraph.add_run(text)
run.font.size = pt(12)
def apply_run_formatting(run, run_data):
"""应用run格式"""
# 基本格式
format_props = [
("bold", "bold"),
("italic", "italic"),
("underline", "underline"),
("strike", "strike"),
("superscript", "superscript"),
("subscript", "subscript"),
("all_caps", "all_caps"),
("small_caps", "small_caps")
]
for json_prop, run_prop in format_props:
if json_prop in run_data:
setattr(run, run_prop, run_data[json_prop])
# 字体大小
if "font_size" in run_data:
run.font.size = pt(run_data["font_size"])
else:
run.font.size = pt(12)
# 字体名称
if "font_name" in run_data:
run.font.name = run_data["font_name"]
try:
run._element.rpr.rfonts.set('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}eastasia', run_data["font_name"])
except:
pass
# 字体颜色
if "color" in run_data and run_data["color"] != "none":
try:
if run_data["color"].startswith("rgb"):
color_str = run_data["color"][4:-1] # 去除"rgb("和")"
r, g, b = map(int, color_str.split(","))
run.font.color.rgb = rgbcolor(r, g, b)
else:
run.font.color.rgb = rgbcolor.from_string(run_data["color"])
except:
pass
# 字符间距
if "character_spacing" in run_data:
try:
run.font.spacing = run_data["character_spacing"]
except:
pass
# 字体背景色(字符底纹)
if "background_color" in run_data:
try:
from docx.oxml import oxmlelement
# 创建或获取rpr元素
rpr = run._element.get_or_add_rpr()
# 创建shd元素
shd = oxmlelement('w:shd')
shd.set('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val', 'clear')
shd.set('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}color', 'auto')
shd.set('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}fill', run_data["background_color"])
# 添加到rpr
rpr.append(shd)
except:
pass
def create_tables(document, json_data):
"""创建表格"""
for table_data in json_data.get("tables", []):
if not table_data.get("rows"):
continue
# 确定表格大小
num_rows = len(table_data["rows"])
num_cols = max(len(row.get("cells", [])) for row in table_data["rows"]) if table_data["rows"] else 1
if num_rows > 0 and num_cols > 0:
table = document.add_table(rows=num_rows, cols=num_cols)
# 应用表格样式
if "style" in table_data:
try:
table.style = table_data["style"]
except:
pass
# 填充表格内容
for i, row_data in enumerate(table_data["rows"]):
if i >= num_rows:
break
for j, cell_data in enumerate(row_data.get("cells", [])):
if j >= num_cols:
break
cell = table.cell(i, j)
populate_cell_content(cell, cell_data)
def populate_cell_content(cell, cell_data):
"""填充单元格内容"""
# 清除默认内容
for paragraph in cell.paragraphs:
p = paragraph._element
p.getparent().remove(p)
# 添加段落内容
if "paragraphs" in cell_data:
for para_data in cell_data["paragraphs"]:
para = cell.add_paragraph()
# 设置段落样式
if "style" in para_data:
try:
para.style = para_data["style"]
except:
pass
# 添加runs
if "runs" in para_data:
for run_data in para_data["runs"]:
text = run_data.get("text", "")
run = para.add_run(text)
apply_run_formatting(run, run_data)
else:
# 直接添加文本
text = para_data.get("text", "")
if text:
run = para.add_run(text)
run.font.size = pt(12)
else:
# 直接添加文本
text = cell_data.get("text", "")
if text:
para = cell.add_paragraph()
run = para.add_run(text)
run.font.size = pt(12)
# 应用单元格样式
try:
# 垂直对齐
if "vertical_alignment" in cell_data:
from docx.enum.table import wd_align_vertical
alignment_str = cell_data["vertical_alignment"]
if "top" in alignment_str:
cell.vertical_alignment = wd_align_vertical.top
elif "center" in alignment_str:
cell.vertical_alignment = wd_align_vertical.center
elif "bottom" in alignment_str:
cell.vertical_alignment = wd_align_vertical.bottom
# 边距
if "top_margin" in cell_data:
cell.top_margin = pt(cell_data["top_margin"])
if "bottom_margin" in cell_data:
cell.bottom_margin = pt(cell_data["bottom_margin"])
if "left_margin" in cell_data:
cell.left_margin = pt(cell_data["left_margin"])
if "right_margin" in cell_data:
cell.right_margin = pt(cell_data["right_margin"])
# 单元格边框
if "borders" in cell_data:
set_cell_border(cell, cell_data["borders"])
except exception as e:
print(f"应用单元格样式时出错: {e}")
def create_images(document, json_data):
"""创建图片"""
for image_data in json_data.get("images", []):
try:
image_bytes = base64.b64decode(image_data["data"])
image_io = io.bytesio(image_bytes)
# 添加图片到文档
paragraph = document.add_paragraph()
run = paragraph.add_run()
run.add_picture(image_io, width=inches(2.0))
except exception as e:
print(f"添加图片时出错: {e}")
def find_all_checkboxes(docx_path):
"""查找文档中所有复选框(增强版)"""
doc = document(docx_path)
results = {
'unchecked': [],
'checked': [],
'locations': [],
'form_controls': []
}
print("=== 开始搜索复选框 ===")
# 搜索段落中的复选框
for para_idx, paragraph in enumerate(doc.paragraphs):
find_checkboxes_in_paragraph(paragraph, f"段落{para_idx + 1}", results)
# 搜索表格中的复选框
for table_idx, table in enumerate(doc.tables):
for row_idx, row in enumerate(table.rows):
for cell_idx, cell in enumerate(row.cells):
for para_idx, paragraph in enumerate(cell.paragraphs):
location = f"表格{table_idx + 1}行{row_idx + 1}列{cell_idx + 1}段落{para_idx + 1}"
find_checkboxes_in_paragraph(paragraph, location, results)
# 搜索页眉页脚
for section_idx, section in enumerate(doc.sections):
for para_idx, paragraph in enumerate(section.header.paragraphs):
find_checkboxes_in_paragraph(paragraph, f"节{section_idx + 1}页眉段落{para_idx + 1}", results)
for para_idx, paragraph in enumerate(section.footer.paragraphs):
find_checkboxes_in_paragraph(paragraph, f"节{section_idx + 1}页脚段落{para_idx + 1}", results)
# 输出结果
print(f"\n=== 统计结果 ===")
print(f"未选中复选框数量: {len(results['unchecked'])}")
print(f"已选中复选框数量: {len(results['checked'])}")
print(f"表单控件数量: {len(results['form_controls'])}")
return results
def find_checkboxes_in_paragraph(paragraph, location, results):
"""在段落中查找复选框"""
try:
p_element = paragraph._element
xml_str = p_element.xml
# 查找传统表单复选框
if 'w:checkbox' in xml_str or 'w14:checkbox' in xml_str:
is_checked = any(marker in xml_str for marker in
['w:checked="1"', 'w:checked w:val="true"', 'w:checked w:val="1"'])
checkbox_info = {
'location': location,
'text': paragraph.text,
'checked': is_checked,
'type': 'form_control'
}
if is_checked:
results['checked'].append(checkbox_info)
else:
results['unchecked'].append(checkbox_info)
results['form_controls'].append(checkbox_info)
print(f"[表单控件] {location}: {'已选中' if is_checked else '未选中'}")
# 查找模拟复选框(文本符号)
checkbox_symbols = {
'unchecked': ['□', '☐', '[ ]', '()', '○'],
'checked': ['✓', '✔', '[x]', '[x]', '[√]', '(x)', '(x)']
}
for symbol in checkbox_symbols['unchecked']:
if symbol in paragraph.text:
results['unchecked'].append({
'location': location,
'text': paragraph.text,
'symbol': symbol,
'type': 'text_symbol'
})
print(f"[文本符号] {location}: 未选中 '{symbol}'")
for symbol in checkbox_symbols['checked']:
if symbol in paragraph.text:
results['checked'].append({
'location': location,
'text': paragraph.text,
'symbol': symbol,
'type': 'text_symbol'
})
print(f"[文本符号] {location}: 已选中 '{symbol}'")
except exception as e:
print(f"检查段落 {location} 时出错: {e}")
def set_cell_border(cell, borders_data):
"""设置单元格边框"""
try:
from docx.oxml import oxmlelement
from docx.oxml.ns import qn
tc = cell._tc
tcpr = tc.get_or_add_tcpr()
# 获取或创建tcborders元素
tcborders = tcpr.first_child_found_in("w:tcborders")
if tcborders is none:
tcborders = oxmlelement('w:tcborders')
tcpr.append(tcborders)
# 根据数据设置边框
for border_name, border_attrs in borders_data.items():
# 检查是否存在该边框元素,如果不存在则创建
element = tcborders.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}%s' % border_name)
if element is none:
element = oxmlelement('w:%s' % border_name)
tcborders.append(element)
# 设置边框属性
for attr_name, attr_value in border_attrs.items():
element.set('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}%s' % attr_name, str(attr_value))
except exception as e:
print(f"设置单元格边框时出错: {e}")
def main():
"""主函数"""
print("docx converter 增强版 v2.0")
print("1. convert docx to json")
print("2. convert json to docx")
print("3. find checkboxes in docx")
print("4. batch convert folder")
choice = input("请选择操作 (1/2/3/4): ")
if choice == "1":
docx_path = input("请输入docx文件路径: ")
if not os.path.exists(docx_path):
print("文件不存在!")
return
json_data = docx_to_json(docx_path)
json_path = docx_path.replace(".docx", "_enhanced.json")
with open(json_path, "w", encoding="utf-8") as f:
json.dump(json_data, f, ensure_ascii=false, indent=2)
print(f"转换完成! json文件已保存为: {json_path}")
elif choice == "2":
json_path = input("请输入json文件路径: ")
if not os.path.exists(json_path):
print("文件不存在!")
return
with open(json_path, "r", encoding="utf-8") as f:
json_data = json.load(f)
output_path = json_path.replace(".json", "_restored.docx")
json_to_docx(json_data, output_path)
print(f"转换完成! docx文件已保存为: {output_path}")
elif choice == "3":
docx_path = input("请输入docx文件路径: ")
if not os.path.exists(docx_path):
print("文件不存在!")
return
results = find_all_checkboxes(docx_path)
print("\n复选框查找完成!")
elif choice == "4":
folder_path = input("请输入文件夹路径: ")
if not os.path.exists(folder_path):
print("文件夹不存在!")
return
# 批量转换逻辑
for filename in os.listdir(folder_path):
if filename.endswith(".docx"):
docx_path = os.path.join(folder_path, filename)
print(f"处理文件: {filename}")
try:
json_data = docx_to_json(docx_path)
json_path = os.path.join(folder_path, filename.replace(".docx", ".json"))
with open(json_path, "w", encoding="utf-8") as f:
json.dump(json_data, f, ensure_ascii=false, indent=2)
print(f"成功转换: {filename}")
except exception as e:
print(f"转换失败 {filename}: {e}")
print("批量转换完成!")
else:
print("无效的选择!")
if __name__ == "__main__":
main()
到此这篇关于python实现增强版docx与json双向转换的完整指南与代码解析的文章就介绍到这了,更多相关python word与json互转内容请搜索代码网以前的文章或继续浏览下面的相关文章希望大家以后多多支持代码网!
发表评论