当前位置: 代码网 > it编程>前端脚本>Python > Python使用python-docx实现自动化处理Word文档

Python使用python-docx实现自动化处理Word文档

2025年05月25日 Python 我要评论
一、引言随着办公自动化需求的增长,python通过python-docx库实现了对word文档的深度操作。本文将展示如何通过代码实现段落样式复制、html表格转word表格以及动态生成可定制化模板的功

一、引言

随着办公自动化需求的增长,python通过python-docx库实现了对word文档的深度操作。本文将展示如何通过代码实现段落样式复制、html表格转word表格以及动态生成可定制化模板的功能。

二、核心功能模块解析

1. 段落样式与图片复制

def copy_inline_shapes(new_doc, img):
    """复制段落中的所有内嵌形状(通常是图片)"""
    new_para = new_doc.add_paragraph()
    for image_bytes, w, h in img:
        # 添加图片到新段落
        new_para.add_run().add_picture(io.bytesio(image_bytes), width=w, height=h)  # 设置宽度为1.25英寸或其他合适的值

功能说明:从旧文档中提取图片并复制至新文档,支持自定义宽度和高度。

使用场景:适用于需要保留原始格式的图文混排文档。

2. html表格转word表格

def docx_table_to_html(word_table):
    # 实现html表单转换逻辑,包括合并单元格处理

功能说明:将解析后的html表格结构转换为word文档中的表格,支持横向/纵向合并。

关键点:

  • 使用beautifulsoup解析html
  • 处理单元格样式、边框和背景颜色
  • 支持多级标题的样式继承

3. 模板生成与样式动态化

def generate_template():
    doc = document()
    for align in [wd_align_paragraph.left, wd_align_paragraph.right, wd_align_paragraph.center, none]:
        for blod_flag in [true, false]:
            # 创建不同样式的段落

功能说明:动态生成包含多种样式(左、右、居中、无)的模板文档。

优势:支持快速扩展新样式,适应不同场景需求。

三、完整示例代码

示例1:复制段落样式与图片

def clone_document(old_s, old_p, old_ws, new_doc_path):
    new_doc = document()
    for para in old_p:
        if "image_none" in para:
            copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0])
        elif "table" in para:
            html_table_to_docx(new_doc, para)
        else:
            clone_paragraph(para)

示例2:html表格转word

def html_table_to_docx(doc, html_content):
    soup = beautifulsoup(html_content, 'html.parser')
    tables = soup.find_all('table')
    for table in tables:
        # 处理合并单元格和样式转换逻辑...

四、关键实现细节

1. 样式复制策略

继承机制:通过run_style和style字段传递字体、对齐等属性。

分页符处理:使用is_page_break判断段落或表格后是否需要换页。

2. 表格转换优化

合并单元格检测:通过tcpr元素识别横向/纵向合并。

样式迁移:保留边框、背景色等视觉属性。

3. 模板动态生成

多样式支持:通过遍历所有段落样式,生成可扩展的模板。

灵活配置:允许用户自定义分页符位置和样式参数。

五、应用场景

场景解决方案
段落排版自动复制样式并保留格式
数据表导出html转word表格,支持合并单元格
报告模板生成动态创建包含多种样式的模板文件

六、总结

通过python-docx库,我们实现了从样式复制到表格转换的完整流程。动态生成的模板功能进一步提升了文档处理的灵活性。无论是处理复杂的图文排版,还是需要快速生成多风格文档的需求,这套解决方案都能提供高效的实现路径。

建议:在实际应用中,可结合python-docx的document对象特性,通过遍历所有元素实现更精细的控制。同时,对异常情况的捕获(如图片格式错误)也是提升健壮性的重要部分。

七、知识扩展

使用模版样式生成文档

from docx import document
from docx.oxml import oxmlelement
from docx.oxml.shared import qn
from wan_neng_copy_word import clone_document as get_para_style,html_table_to_docx
import io


# 剩余部分保持不变...

def copy_inline_shapes(new_doc, img):
    """复制段落中的所有内嵌形状(通常是图片)"""
    new_para = new_doc.add_paragraph()
    for image_bytes, w, h in img:
        # 添加图片到新段落
        new_para.add_run().add_picture(io.bytesio(image_bytes), width=w, height=h)  # 设置宽度为1.25英寸或其他合适的值


def copy_paragraph_style(run_from, run_to):
    """复制 run 的样式"""
    run_to.bold = run_from.bold
    run_to.italic = run_from.italic
    run_to.underline = run_from.underline
    run_to.font.size = run_from.font.size
    run_to.font.color.rgb = run_from.font.color.rgb
    run_to.font.name = run_from.font.name
    run_to.font.all_caps = run_from.font.all_caps
    run_to.font.strike = run_from.font.strike
    run_to.font.shadow = run_from.font.shadow


def is_page_break(element):
    """判断元素是否为分页符(段落或表格后)"""
    if element.tag.endswith('p'):
        for child in element:
            if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                return true
    elif element.tag.endswith('tbl'):
        # 表格后可能有分页符(通过下一个元素判断)
        if element.getnext() is not none:
            next_element = element.getnext()
            if next_element.tag.endswith('p'):
                for child in next_element:
                    if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                        return true
    return false


def clone_paragraph(para_style, text, new_doc, para_style_ws):
    """根据旧段落创建新段落"""
    new_para = new_doc.add_paragraph()
    para_style_ws = list(para_style_ws["style"].values())[0]
    para_style_data = list(para_style["style"].values())[0]
    para_style_ws.font.size = para_style_data.font.size

    new_para.style = para_style_ws

    new_run = new_para.add_run(text)
    copy_paragraph_style(para_style["run_style"][0], new_run)
    new_para.alignment = list(para_style["alignment"].values())[0]

    return new_para


def copy_cell_borders(old_cell, new_cell):
    """复制单元格的边框样式"""
    old_tc = old_cell._tc
    new_tc = new_cell._tc

    old_borders = old_tc.xpath('.//w:tcborders')
    if old_borders:
        old_border = old_borders[0]
        new_border = oxmlelement('w:tcborders')

        border_types = ['top', 'left', 'bottom', 'right', 'insideh', 'insidev']
        for border_type in border_types:
            old_element = old_border.find(f'.//w:{border_type}', namespaces={
                'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
            })
            if old_element is not none:
                new_element = oxmlelement(f'w:{border_type}')
                for attr, value in old_element.attrib.items():
                    new_element.set(attr, value)
                new_border.append(new_element)

        tc_pr = new_tc.get_or_add_tcpr()
        tc_pr.append(new_border)


def clone_table(old_table, new_doc):
    """根据旧表格创建新表格"""
    new_table = new_doc.add_table(rows=len(old_table.rows), cols=len(old_table.columns))
    if old_table.style:
        new_table.style = old_table.style

    for i, old_row in enumerate(old_table.rows):
        for j, old_cell in enumerate(old_row.cells):
            new_cell = new_table.cell(i, j)
            for paragraph in new_cell.paragraphs:
                new_cell._element.remove(paragraph._element)
            for old_paragraph in old_cell.paragraphs:
                new_paragraph = new_cell.add_paragraph()
                for old_run in old_paragraph.runs:
                    new_run = new_paragraph.add_run(old_run.text)
                    copy_paragraph_style(old_run, new_run)
                new_paragraph.alignment = old_paragraph.alignment
            copy_cell_borders(old_cell, new_cell)

    for i, col in enumerate(old_table.columns):
        if col.width is not none:
            new_table.columns[i].width = col.width

    return new_table


def clone_document(old_s, old_p, old_ws, new_doc_path):
    new_doc = document()

    # 复制主体内容
    for para in old_p:
        for k, v in para.items():

            if "image_none" == k:
                # print()
                copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0])
            elif "table" == k:
                html_table_to_docx(new_doc,v)
            else:
                style = [i for i in old_s if v in list(i["style"].keys()) and "style" in i]
                style_ws = [i for i in old_ws if v in list(i["style"].keys()) and "style" in i]
                clone_paragraph(style[0], k, new_doc, style_ws[0])

    new_doc.save(new_doc_path)


# 使用示例
if __name__ == "__main__":
    body_ws, _ = get_para_style('demo_template.docx')
    body_s, body_p = get_para_style("南山三防工作专报1.docx")
    clone_document(body_s, body_p, body_ws, 'cloned_example.docx')

模版样式文本分离

from docx.enum.text import wd_break

from docx import document
from docx.enum.text import wd_align_paragraph
from docx.oxml import oxmlelement
from bs4 import beautifulsoup

from docx.oxml.ns import qn

def docx_table_to_html(word_table):
    soup = beautifulsoup(features='html.parser')
    html_table = soup.new_tag('table', style="border-collapse: collapse;")

    # 记录哪些单元格已经被合并
    merged_cells = [[false for _ in range(len(word_table.columns))] for _ in range(len(word_table.rows))]

    for row_idx, row in enumerate(word_table.rows):
        html_tr = soup.new_tag('tr')

        col_idx = 0
        while col_idx < len(row.cells):
            cell = row.cells[col_idx]

            # 如果该单元格已经被合并(被前面的 colspan 或 rowspan 占用),跳过
            if merged_cells[row_idx][col_idx]:
                col_idx += 1
                continue

            # 跳过纵向合并中被“continue”的单元格
            v_merge = cell._element.tcpr and cell._element.tcpr.find(qn('w:vmerge'))
            if v_merge is not none and v_merge.get(qn('w:val')) == 'continue':
                col_idx += 1
                continue

            td = soup.new_tag('td')

            # 设置文本内容
            td.string = cell.text.strip()

            # 初始化样式字符串
            td_style = ''

            # 获取单元格样式
            if cell._element.tcpr:
                tc_pr = cell._element.tcpr

                # 处理背景颜色
                shd = tc_pr.find(qn('w:shd'))
                if shd is not none:
                    bg_color = shd.get(qn('w:fill'))
                    if bg_color:
                        td_style += f'background-color:#{bg_color};'

                # 处理对齐方式
                jc = tc_pr.find(qn('w:jc'))
                if jc is not none:
                    align = jc.get(qn('w:val'))
                    if align == 'center':
                        td_style += 'text-align:center;'
                    elif align == 'right':
                        td_style += 'text-align:right;'
                    else:
                        td_style += 'text-align:left;'

                # 处理边框
                borders = tc_pr.find(qn('w:tcborders'))
                if borders is not none:
                    for border_type in ['top', 'left', 'bottom', 'right']:
                        border = borders.find(qn(f'w:{border_type}'))
                        if border is not none:
                            color = border.get(qn('w:color'), '000000')
                            size = int(border.get(qn('w:sz'), '4'))  # 半点单位,1pt = 2sz
                            style = border.get(qn('w:val'), 'single')
                            td_style += f'border-{border_type}:{size // 2}px {style} #{color};'

                # 处理横向合并(colspan)
                grid_span = tc_pr.find(qn('w:gridspan'))
                if grid_span is not none:
                    colspan = int(grid_span.get(qn('w:val'), '1'))
                    if colspan > 1:
                        td['colspan'] = colspan
                        # 标记后面被合并的单元格
                        for c in range(col_idx + 1, col_idx + colspan):
                            if c < len(row.cells):
                                merged_cells[row_idx][c] = true

                # 处理纵向合并(rowspan)
                v_merge = tc_pr.find(qn('w:vmerge'))
                if v_merge is not none and v_merge.get(qn('w:val')) != 'continue':
                    rowspan = 1
                    next_row_idx = row_idx + 1
                    while next_row_idx < len(word_table.rows):
                        next_cell = word_table.rows[next_row_idx].cells[col_idx]
                        next_v_merge = next_cell._element.tcpr and next_cell._element.tcpr.find(qn('w:vmerge'))
                        if next_v_merge is not none and next_v_merge.get(qn('w:val')) == 'continue':
                            rowspan += 1
                            next_row_idx += 1
                        else:
                            break
                    if rowspan > 1:
                        td['rowspan'] = rowspan
                        # 标记后面被合并的行
                        for r in range(row_idx + 1, row_idx + rowspan):
                            if r < len(word_table.rows):
                                merged_cells[r][col_idx] = true

            # 设置样式和默认边距
            td['style'] = td_style + "padding: 5px;"
            html_tr.append(td)

            # 更新列索引
            if 'colspan' in td.attrs:
                col_idx += int(td['colspan'])
            else:
                col_idx += 1

        html_table.append(html_tr)

    soup.append(html_table)
    return str(soup)

def set_cell_background(cell, color_hex):
    """设置单元格背景色"""
    color_hex = color_hex.lstrip('#')
    shading_elm = oxmlelement('w:shd')
    shading_elm.set(qn('w:fill'), color_hex)
    cell._tc.get_or_add_tcpr().append(shading_elm)


def html_table_to_docx(doc, html_content):
    """
    将 html 中的表格转换为 word 文档中的表格
    :param html_content: html 字符串
    :param doc: python-docx document 实例
    """
    soup = beautifulsoup(html_content, 'html.parser')
    tables = soup.find_all('table')

    for html_table in tables:
        # 获取表格行数
        trs = html_table.find_all('tr')
        rows = len(trs)

        # 估算最大列数(考虑 colspan)
        cols = 0
        for tr in trs:
            col_count = 0
            for cell in tr.find_all(['td', 'th']):
                col_count += int(cell.get('colspan', 1))
            cols = max(cols, col_count)

        # 创建 word 表格
        table = doc.add_table(rows=rows, cols=cols)
        table.style = 'table grid'

        # 记录已处理的单元格(用于处理合并)
        used_cells = [[false for _ in range(cols)] for _ in range(rows)]

        for row_idx, tr in enumerate(trs):
            cells = tr.find_all(['td', 'th'])
            col_idx = 0

            for cell in cells:
                while col_idx < cols and used_cells[row_idx][col_idx]:
                    col_idx += 1

                if col_idx >= cols:
                    break  # 避免越界

                # 获取 colspan 和 rowspan
                colspan = int(cell.get('colspan', 1))
                rowspan = int(cell.get('rowspan', 1))

                # 获取文本内容
                text = cell.get_text(strip=true)

                # 获取对齐方式
                align = cell.get('align')
                align_map = {
                    'left': wd_align_paragraph.left,
                    'center': wd_align_paragraph.center,
                    'right': wd_align_paragraph.right
                }

                # 获取背景颜色
                style = cell.get('style', '')
                bg_color = none
                for s in style.split(';'):
                    if 'background-color' in s or 'background' in s:
                        bg_color = s.split(':')[1].strip()
                        break

                # 获取 word 单元格
                word_cell = table.cell(row_idx, col_idx)

                # 合并单元格
                if colspan > 1 or rowspan > 1:
                    end_row = min(row_idx + rowspan - 1, rows - 1)
                    end_col = min(col_idx + colspan - 1, cols - 1)
                    merged_cell = table.cell(row_idx, col_idx).merge(table.cell(end_row, end_col))
                    word_cell = merged_cell

                # 设置文本内容
                para = word_cell.paragraphs[0]
                para.text = text

                # 设置对齐方式
                if align in align_map:
                    para.alignment = align_map[align]

                # 设置背景颜色
                if bg_color:
                    try:
                        set_cell_background(word_cell, bg_color)
                    except:
                        pass  # 忽略无效颜色格式

                # 标记已使用的单元格
                for r in range(row_idx, min(row_idx + rowspan, rows)):
                    for c in range(col_idx, min(col_idx + colspan, cols)):
                        used_cells[r][c] = true

                # 移动到下一个可用列
                col_idx += colspan

        # 添加空段落分隔
        doc.add_paragraph()

    return doc


def copy_inline_shapes(old_paragraph):
    """复制段落中的所有内嵌形状(通常是图片)"""
    images = []
    for shape in old_paragraph._element.xpath('.//w:drawing'):
        blip = shape.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'})
        if blip is not none:
            rid = blip.attrib['{http://schemas.openxmlformats.org/officedocument/2006/relationships}embed']
            image_part = old_paragraph.part.related_parts[rid]
            image_bytes = image_part.image.blob
            images.append([image_bytes, image_part.image.width, image_part.image.height])
    return images


def is_page_break(element):
    """判断元素是否为分页符(段落或表格后)"""
    if element.tag.endswith('p'):
        for child in element:
            if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                return true
    elif element.tag.endswith('tbl'):
        # 表格后可能有分页符(通过下一个元素判断)
        if element.getnext() is not none:
            next_element = element.getnext()
            if next_element.tag.endswith('p'):
                for child in next_element:
                    if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                        return true
    return false


def clone_paragraph(old_para):
    """根据旧段落创建新段落"""
    style = {"run_style": []}
    if old_para.style:
        # 这里保存style  主要通过字体识别   是 几级标题
        style_name_to_style_obj = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.style}
        style["style"] = style_name_to_style_obj
    paras = []
    for old_run in old_para.runs:
        text_to_style_name = {old_run.text: old_para.style.name + "_" + str(old_para.alignment).split()[0]}
        style["run_style"].append(old_run)
        paras.append(text_to_style_name)

    style_name_to_alignment = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.alignment}
    style["alignment"] = style_name_to_alignment

    images = copy_inline_shapes(old_para)
    if len(images):
        style["image"] = images
        paras.append({"image_none": "image_none"})
    return style, paras


def clone_document(old_doc_path):
    try:
        old_doc = document(old_doc_path)
        new_doc = document()
        # 复制主体内容
        elements = old_doc.element.body
        para_index = 0
        table_index = 0
        index = 0

        body_style = []
        body_paras = []

        while index < len(elements):
            element = elements[index]
            if element.tag.endswith('p'):
                old_para = old_doc.paragraphs[para_index]
                style, paras = clone_paragraph(old_para)
                body_style.append(style)
                body_paras += paras
                para_index += 1
                index += 1
            elif element.tag.endswith('tbl'):
                old_table = old_doc.tables[table_index]
                body_paras += [{"table": docx_table_to_html(old_table)}]
                table_index += 1
                index += 1
            elif element.tag.endswith('br') and element.get(qn('type')) == 'page':
                if index > 0:
                    body_paras.append("br")
                    new_doc.add_paragraph().add_run().add_break(wd_break.page)
                index += 1
            else:
                index += 1

            # 检查分页符
            if index < len(elements) and is_page_break(elements[index]):
                if index > 0:
                    new_doc.add_paragraph().add_run().add_break(wd_break.page)
                    body_paras.append("br")
                index += 1

        else:
            return body_style, body_paras
    except exception as e:
        print(f"复制文档时发生错误:{e}")


# 使用示例
if __name__ == "__main__":
    # 示例html表格
    body_s, body_p = clone_document('专报1.docx')

生成可更改模版

from docx import document
from docx.enum.text import wd_align_paragraph

# 创建一个新的word文档
doc = document()
for align in [wd_align_paragraph.left, wd_align_paragraph.right, wd_align_paragraph.center, none]:
    for blod_flag in [true, false]:

        # 获取所有可用的段落样式名(只保留段落样式)
        paragraph_styles = [
            style for style in doc.styles if style.type == 1  # type == 1 表示段落样式
        ]

        # 输出样式数量
        print(f"共找到 {len(paragraph_styles)} 种段落样式:")
        for style in paragraph_styles:
            print(f"- {style.name}")

        # 在文档中添加每个样式对应的段落
        for style in paragraph_styles:
            heading = doc.add_paragraph()
            run = heading.add_run(f"样式名称: {style.name}")
            run.bold = blod_flag
            para = doc.add_paragraph(f"这是一个应用了 '{style.name}' 样式的段落示例。", style=style)
            para.alignment = align
            # 添加分隔线(可选)
            doc.add_paragraph("-" * 40)

# 保存为 demo_template.docx
doc.save("demo_template.docx")
print("\n✅ 已生成包含所有段落样式的模板文件:demo_template.docx")

以上就是python使用python-docx实现自动化处理word文档的详细内容,更多关于python自动化处理word的资料请关注代码网其它相关文章!

(0)

相关文章:

版权声明:本文内容由互联网用户贡献,该文观点仅代表作者本人。本站仅提供信息存储服务,不拥有所有权,不承担相关法律责任。 如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 2386932994@qq.com 举报,一经查实将立刻删除。

发表评论

验证码:
Copyright © 2017-2025  代码网 保留所有权利. 粤ICP备2024248653号
站长QQ:2386932994 | 联系邮箱:2386932994@qq.com