Python实现word文档内容智能提取以及合成_Python

如何从10个左右的docx文档中抽取内容，生成新的文档，抽取内容包括源文档的文字内容、图片、表格、公式等，以及目标文档的样式排版、字体、格式，还有目标文档的语言风格、用词规范、文法习惯等等。这是一个相当复杂的需求，因为它不仅涉及内容提取，还涉及深度格式化和风格模仿。完全自动化的完美解决方案难度极高，特别是对于复杂的公式和微妙的语言风格。

一个务实的方案是采用自动化 + 人工辅助的混合策略。以下是详细的思路、技术路径、方法和步骤：

核心思路

内容提取 (自动化为主): 使用编程方式从源 docx 文件中提取所需的核心内容（文字、图片、表格、公式的某种表示）。

样式应用 (自动化): 基于一个定义了目标样式、排版、字体等的模板文档，将提取的内容插入新文档，并应用模板中定义的样式。

语言风格调整 (自动化辅助 + 人工): 利用大型语言模型 (llm) 或自然语言处理 (nlp) 技术对提取的文本进行初步的风格、用词和文法调整，然后进行人工审阅和精修。

复杂元素处理 (人工为主): 对于难以自动处理的元素（如复杂公式、特定排版），进行人工调整。

技术路径

主要工具: python 编程语言

核心库:

python-docx: 用于读取和写入 docx 文件（文本、表格、图片、基本样式应用）。
(可选) 用于公式处理: 可能需要解析 docx 的底层 xml (ooxml)，或者寻找专门处理 mathml/omml 的库（这部分比较困难），或者将公式提取为图片。
(可选) 用于图片处理: pillow (pil fork) 可能需要用于处理图片。
(可选) 用于语言风格调整: 调用大型语言模型 api (如 openai gpt 系列、google gemini、或其他类似服务)。

辅助工具:

microsoft word: 用于创建模板文档、最终审阅和调整。
xml 编辑器 (可选): 用于深入分析 docx 内部结构（特别是公式）。

实现步骤

阶段一：准备工作

1.创建目标模板文档 (template.docx):

在 word 中创建一个新文档。
定义样式: 精心设置所有需要的样式（标题 1、标题 2、正文、引用、列表、表格样式等），包括字体、字号、颜色、段落间距、缩进等。确保样式名称清晰易懂（例如 targetheading1, targetbodytext, targettablestyle）。
设置页面布局: 页边距、纸张大小、页眉页脚等。
保存: 将此文档保存为 template.docx。这将是所有新生成文档的基础。

2.明确提取规则:

关键: 你需要非常清楚地定义哪些内容需要从每个源文档中提取出来。规则可以基于：

特定标题: “提取 ‘第三章方法’ 下的所有内容”。
特定样式: “提取所有应用了 ‘源文档重点’ 样式的内容”。
关键词/标记: “提取包含 ‘[extract]’ 标记的段落”。
结构位置: “提取每个文档的第二个表格”。
人工指定: (最灵活但最慢) 手动在源文档中标记要提取的内容（例如使用 word 的批注功能或特定高亮颜色），然后让脚本识别这些标记。
文档化规则: 将这些规则清晰地记录下来，以便编写脚本。

3.设置开发环境:

安装 python。

使用 pip 安装必要的库:

pip install python-docx pillow requests # 如果需要调用 llm api
# 可能需要其他库，取决于具体实现

(可选) 获取 llm api 密钥。

阶段二：内容提取 (python 脚本)

import os
from docx import document
from docx.shared import inches
# 可能需要导入其他模块，如处理 xml 或调用 api

# --- 配置 ---
source_docs_dir = 'files/transform/docx/source_documents'
target_template = 'files/transform/docx/template.docx'
output_doc_path = 'files/transform/docx/generated_document.docx'
extraction_rules = { # 示例规则，需要根据你的实际情况修改
    'source_doc_1.docx': {'heading_start': 'chapter 3', 'heading_end': 'chapter 4'},
    'source_doc_2.docx': {'style_name': 'sourcehighlight'},
    # ... 其他文档的规则
}

# --- 辅助函数 (示例) ---
def should_extract_paragraph(paragraph, rules):
    # 实现基于规则判断段落是否应该提取的逻辑
    # 例如：检查段落文本是否匹配、样式是否匹配等
    # 返回 true 或 false
    # (这部分逻辑需要根据你的具体规则编写)
    style_name = paragraph.style.name
    text = paragraph.text.strip()
    # 示例：基于样式的简单规则
    if 'style_name' in rules and style_name == rules['style_name']:
        return true
    # 示例：基于起始标题的简单规则（需要状态管理）
    # if 'heading_start' in rules ... (需要更复杂的逻辑来跟踪当前章节)
    return false # 默认不提取

def extract_content_from_doc(source_path, rules):
    """从单个源文档提取内容"""
    extracted_elements = []
    try:
        source_doc = document(source_path)
        # 标记是否处于提取区域（例如，在特定章节之间）
        in_extraction_zone = false # 需要根据规则调整初始状态

        for element in source_doc.element.body:
            # 处理不同类型的元素：段落、表格等
            if element.tag.endswith('p'): # 是段落
                paragraph = docx.text.paragraph.paragraph(element, source_doc)

                # --- 核心提取逻辑 ---
                # 这里需要根据你的 extraction_rules 实现复杂的判断逻辑
                # 例如，判断是否遇到起始标题，是否遇到结束标题，段落样式是否匹配等
                # 这是一个简化的示例，实际可能需要更精细的状态管理
                if 'heading_start' in rules and paragraph.style.name.startswith('heading') and rules['heading_start'] in paragraph.text:
                    in_extraction_zone = true
                    continue # 不提取起始标题本身？看需求
                if 'heading_end' in rules and paragraph.style.name.startswith('heading') and rules['heading_end'] in paragraph.text:
                    in_extraction_zone = false
                    continue # 到达结束标题，停止提取

                if in_extraction_zone or should_extract_paragraph(paragraph, rules):
                     # 提取文本内容
                    text_content = paragraph.text
                     # 尝试提取基本格式（粗体、斜体） - 比较复杂，可能需要遍历 runs
                    # todo: 提取图片 (需要检查段落中的 inline_shapes 或 runs 中的 drawing)
                    # todo: 提取公式 (极具挑战性，见下文讨论)
                    extracted_elements.append({'type': 'paragraph', 'text': text_content, 'style': paragraph.style.name}) # 可以携带源样式名供参考

            elif element.tag.endswith('tbl'): # 是表格
                table = docx.table.table(element, source_doc)
                # --- 提取表格 ---
                # todo: 实现表格提取逻辑，可能需要检查是否在提取区域内
                # if in_extraction_zone or table_should_be_extracted(table, rules):
                table_data = []
                for row in table.rows:
                    row_data = [cell.text for cell in row.cells]
                    table_data.append(row_data)
                extracted_elements.append({'type': 'table', 'data': table_data})

            # --- 处理图片 ---
            # 查找段落内的图片 (inline_shapes)
            # paragraph = docx.text.paragraph.paragraph(element, source_doc) # re-get paragraph object if needed
            # for run in paragraph.runs:
            #     if run.element.xpath('.//wp:inline | .//wp:anchor'): # check for drawings
            #         # this part is complex: need to get image data (rid) and relate it back
            #         # to the actual image part in the docx package.
            #         # python-docx can extract images, but associating them perfectly
            #         # with their original position during extraction requires care.
            #         # placeholder:
            #         # image_data = get_image_data(run, source_doc)
            #         # if image_data:
            #         #    extracted_elements.append({'type': 'image', 'data': image_data, 'filename': f'img_{len(extracted_elements)}.png'})
            pass # placeholder for image extraction logic

    except exception as e:
        print(f"error processing {source_path}: {e}")
    return extracted_elements

# --- 主流程 ---
all_extracted_content = []
source_files = [f for f in os.listdir(source_docs_dir) if f.endswith('.docx')]

for filename in source_files:
    source_path = os.path.join(source_docs_dir, filename)
    rules = extraction_rules.get(filename, {}) # 获取该文件的提取规则
    if rules: # 只处理定义了规则的文件
        print(f"extracting from: {filename}")
        content = extract_content_from_doc(source_path, rules)
        all_extracted_content.extend(content)
    else:
        print(f"skipping {filename}, no rules defined.")

print(f"total elements extracted: {len(all_extracted_content)}")

阶段三：语言风格调整 (可选, python + llm api)

# --- ---
import requests
import json

# --- 配置 llm ---
llm_api_url = "your_llm_api_endpoint" # e.g., openai api url
llm_api_key = "your_llm_api_key"
llm_prompt_template = """
请根据以下要求，改写这段文字：
目标语言风格：[在此处详细描述，例如：正式、客观、简洁]
用词规范：[在此处列出规范，例如：使用“用户”而非“客户”，避免使用缩写]
文法习惯：[在此处描述，例如：多使用主动语态，句子长度适中]
目标受众：[描述目标读者]

原文：
"{text}"

改写后的文字：
"""

def adapt_text_style(text):
    """使用 llm api 调整文本风格"""
    if not text.strip():
        return text # 跳过空文本

    prompt = llm_prompt_template.format(text=text)
    headers = {
        "authorization": f"bearer {llm_api_key}",
        "content-type": "application/json",
    }
    data = {
        "model": "gpt-4", # 或你使用的模型
        "prompt": prompt,
        "max_tokens": 1024, # 根据需要调整
        "temperature": 0.5, # 控制创造性，较低值更保守
    }
    try:
        response = requests.post(llm_api_url, headers=headers, json=data)
        response.raise_for_status() # 检查 http 错误
        result = response.json()
        # 解析 llm 返回的结果，注意不同 api 的格式可能不同
        rewritten_text = result['choices'][0]['text'].strip() # 示例路径
        print(f"original: {text[:50]}... | rewritten: {rewritten_text[:50]}...")
        return rewritten_text
    except requests.exceptions.requestexception as e:
        print(f"error calling llm api: {e}")
        return text # 出错时返回原文
    except (keyerror, indexerror) as e:
        print(f"error parsing llm response: {e} - response: {response.text}")
        return text # 出错时返回原文

# --- 应用风格调整 ---
adjusted_content = []
for element in all_extracted_content:
    if element['type'] == 'paragraph':
        # --- 调用 llm api ---
        # adjusted_text = adapt_text_style(element['text'])
        # element['text'] = adjusted_text # 更新文本
        # --- 或者先不调用，等生成后再处理 ---
        adjusted_content.append(element)
    elif element['type'] == 'table':
         # 表格内容也可以逐个单元格处理，但可能效果不佳或成本高
         # 更好的方法可能是将表格内容整理成文本描述给 llm，或者人工处理
         adjusted_content.append(element)
    elif element['type'] == 'image':
         # 图片无法直接处理
         adjusted_content.append(element)
    # 处理其他类型...

# --- (接续到下一阶段：文档生成) ---

阶段四：生成目标文档 (python 脚本)

# --- (续上) ---

# --- 创建目标文档 (基于模板) ---
try:
    target_doc = document(target_template)
except exception as e:
    print(f"error loading template {target_template}: {e}")
    # 可以考虑创建一个空文档作为后备
    # target_doc = document()
    exit()


# --- 填充内容并应用样式 ---
for element in adjusted_content: # 使用调整后的内容，或者原始提取内容
    if element['type'] == 'paragraph':
        text = element['text']
        # --- 核心：应用模板中定义的样式 ---
        # 简单方式：所有段落应用默认正文样式
        # target_doc.add_paragraph(text, style='targetbodytext') # 假设模板中有此样式

        # 复杂方式：根据源文档信息或内容判断应用哪个目标样式
        # 示例：如果源样式是 heading 1，应用 targetheading1
        source_style = element.get('style', '') # 获取源样式名（如果提取时保存了）
        if source_style.startswith('heading 1'):
             target_doc.add_paragraph(text, style='targetheading1') # 假设模板中有此样式
        elif source_style.startswith('heading 2'):
             target_doc.add_paragraph(text, style='targetheading2')
        # ... 其他样式映射规则
        else:
             target_doc.add_paragraph(text, style='targetbodytext') # 默认样式

    elif element['type'] == 'table':
        table_data = element['data']
        if table_data:
            # 创建表格
            num_rows = len(table_data)
            num_cols = len(table_data[0]) if num_rows > 0 else 0
            if num_rows > 0 and num_cols > 0:
                # --- 应用模板中定义的表格样式 ---
                table = target_doc.add_table(rows=num_rows, cols=num_cols, style='targettablestyle') # 假设模板中有此表格样式
                # 填充数据
                for i, row_data in enumerate(table_data):
                    for j, cell_text in enumerate(row_data):
                        # 防止列数不匹配错误
                        if j < len(table.rows[i].cells):
                            table.rows[i].cells[j].text = cell_text
                # 可以添加更多表格格式化代码，如设置列宽等

    elif element['type'] == 'image':
        # --- 添加图片 ---
        # image_data = element['data']
        # image_filename = element['filename']
        # # 需要将 image_data 保存为临时文件或使用 bytesio
        # from io import bytesio
        # image_stream = bytesio(image_data)
        # try:
        #    target_doc.add_picture(image_stream, width=inches(4.0)) # 调整宽度
        # except exception as e:
        #    print(f"error adding image {image_filename}: {e}")
        pass # placeholder for image insertion

    # --- 处理公式 (挑战) ---
    # 如果公式被提取为图片:
    #   elif element['type'] == 'formula_image':
    #       # 添加图片...
    # 如果公式被提取为 mathml/omml (xml 字符串):
    #   elif element['type'] == 'formula_mathml':
    #       # 使用 python-docx 直接插入 mathml 很困难
    #       # 可能需要直接操作 ooxml (非常复杂)
    #       # 或者，在段落中插入一个占位符 "[formula]"，然后手动替换
    #       target_doc.add_paragraph(f"[formula: {element['id']}]", style='targetbodytext')
    # 如果公式被提取为纯文本近似值:
    #   elif element['type'] == 'formula_text':
    #       target_doc.add_paragraph(element['text'], style='formulastyle') # 可能需要特殊样式

# --- 保存最终文档 ---
try:
    target_doc.save(output_doc_path)
    print(f"document successfully generated: {output_doc_path}")
except exception as e:
    print(f"error saving document: {e}")

阶段五：人工审阅与精修

1.打开生成的文档 (generated_document.docx)。

2.检查整体结构和内容完整性: 是否所有需要的内容都被提取并放置在正确的位置？

3.检查样式和格式:

所有文本是否应用了正确的模板样式？
字体、字号、间距是否符合要求？
表格样式是否正确？列宽、对齐是否需要调整？
图片位置和大小是否合适？

4.检查语言风格和规范:

通读文本，检查语气、用词是否符合目标要求。
修正 llm 可能产生的错误或不自然的表达。
确保术语统一。
进行拼写和语法检查。

5.处理复杂元素:

公式: 这是最可能需要手动操作的地方。如果脚本插入了占位符，你需要手动将源文档中的公式复制粘贴过来，或者使用 word 的公式编辑器重新创建它们。确保公式的编号和引用正确。

特殊排版: 检查是否有需要特殊布局（如图文混排、分栏等）的地方，并手动调整。

6.最终定稿: 保存修改后的文档。

关于公式处理的挑战与策略

难点: docx 中的公式通常使用 omml (office math markup language) 存储，嵌套在复杂的 xml 结构中。python-docx 对此支持有限。

策略:

提取为图片 (最可行): 尝试在提取阶段将公式渲染或截图为图片。这会丢失编辑能力，但能保证视觉效果。实现起来也有难度，可能需要借助其他工具或库（如 docx2python 库可能提供一些帮助，或者需要分析 ooxml 找到图片表示）。
提取为 mathml/omml (复杂): 解析 ooxml，提取公式的 xml 片段。但 python-docx 无法直接将这些 xml 重新插入并渲染为公式。需要非常底层的 ooxml 操作。
提取为近似文本 (简单但损失精度): python-docx 读取包含公式的段落 text 属性时，有时会得到一个纯文本的近似表示。这对于简单公式可能够用，但复杂公式会完全失真。
手动处理 (最可靠): 在脚本中识别出公式位置，插入占位符，然后在人工审阅阶段手动复制/创建公式。

总结

这是一个多阶段、结合自动化和人工的过程。

自动化强项: 重复性的内容提取、基于模板的样式应用、初步的文本风格转换（使用 llm）。

人工介入点: 定义精确的提取规则、处理复杂公式、精调语言风格和术语、最终的格式微调和质量检查。

投入时间最多的部分将是编写和调试提取逻辑以及最终的人工审阅和修正。务必从少量文档和简单规则开始，逐步迭代和完善你的脚本。

以上就是python实现word文档内容智能提取以及合成的详细内容，更多关于python word文档内容提取与合成的资料请关注代码网其它相关文章！

Python实现word文档内容智能提取以及合成

2025年04月20日 • Python •我要评论

核心思路

技术路径

实现步骤

阶段一：准备工作

阶段二：内容提取 (python 脚本)

阶段三：语言风格调整 (可选, python + llm api)

阶段四：生成目标文档 (python 脚本)

阶段五：人工审阅与精修

关于公式处理的挑战与策略

总结

相关文章:

Python日志模块Logging使用指北(最新推荐)

Python汉字转拼音pypinyin库、输出excel的xlwt库

发表评论


验证码：