Python如何调用spire.doc轻松读取Word文档内容_Python

前言

spire.doc for .net 是一款专门对 word 文档进行操作的 .net 类库。这款控件的主要功能在于帮助开发人员轻松快捷高效的创建、编辑、转换、比较和打印 microsoft word 文档。作为一款独立的 word .net 控件，spire.doc for .net 的运行系统（服务器端或客户端）均无需安装 microsoft word，但是它却可以将 microsoft word 文档的操作功能集成到任何开发人员的 .net（asp.net、windows form、.net core、.net 5.0、.net 6.0、.net 7.0、.net standard、 xamarin 和 mono android）应用程序中。

注意，文件在读取或写入操作时必须是关闭状态，否则会报错。

读取全部文本内容

from spire.doc import *
from spire.doc.common import *
 
inputfile = r'自检测试报告.doc'
outputfile = r'自检测试报告.docx'
 
document = document()  # 创建document实例
document.loadfromfile(inputfile)  # 加载word文档
document_text = document.gettext()
print(document_text)

通过节点读取数据

document.sections[index] 属性可用于获取word 文档中的特定节点。获取后，可遍历该节中的段落、表格等。

print(len(document.sections))  # 获取节点数量
print(document.sections.count)  # 获取节点数量
section = document.sections
 
# 分段落获取文本内容
for i in range(document.sections.count):
    paragraphs = section[i].paragraphs
    for p in range(paragraphs.count):
        print(paragraphs[p].text)

按页读取

因为word文档本质上是流式文档，流式布局，所以没有“页面”的概念。为了方便页面操作，spire.doc for python提供了fixedlayoutdocument类，用于将word文档转换为固定布局。

layoutdoc = fixedlayoutdocument(document)  # 创建fixedlayoutdocument类的实例，用于将word文档转换为固定布局。
 
print(layoutdoc.pages.count)
 
for p in range(layoutdoc.pages.count):
    page_data = layoutdoc.pages[p]
    # print(page_data.text)   # 按页读取文本
    cols_data = page_data.columns
    for col in range(len(cols_data)):
        # print(cols_data[col].text)  # 按段读取文本
        row_data = cols_data[col].lines
        for row in range(len(row_data)):
            print(row_data[row].text)  # 按行读取文本

读取页眉页脚

section = document.sections
 
for i in range(document.sections.count):
 
    header = section[i].headersfooters.header  # 获取该节的页眉对象
 
    footer = section[i].headersfooters.footer  # 获取该节的页脚对象
    for h in range(header.paragraphs.count):
        headerpara = header.paragraphs[h]
        print(headerpara.text)
        
    for f in range(footer.paragraphs.count):
        footerpara = footer.paragraphs[f]
        print(footerpara.text)

遍历表格数据

document = document()  # 创建document实例
document.loadfromfile(inputfile)  # 加载word文档
 
for i in range(document.sections.count):
    section = document.sections.get_item(i)
    for j in range(section.tables.count):
        table = section.tables.get_item(j)
 
        # 遍历表格中的行
        for row in range(table.rows.count):
            row_data = []
 
            # 遍历行中的单元格
            for cell in range(table.rows.get_item(row).cells.count):
                cell_obj = table.rows.get_item(row).cells.get_item(cell)
                cell_text = ""
 
                # 获取单元格中的段落内容
                for paragraph_index in range(cell_obj.paragraphs.count):
                    paragraph = cell_obj.paragraphs.get_item(paragraph_index)
                    cell_text += paragraph.text
 
                row_data.append(cell_text.strip())
 
            # 打印行数据
            print(row_data)
            
document.close()

查找指定文本

def findallstring(self ,matchstring:str,casesensitive:bool,wholeword:bool)->list['textselection']

参数：

matchstring:str，要查找的内容
casesensitive:bool，如果为true，匹配是区分大小写的。
wholeword:bool，如果为true，匹配的必须是一个完整的单词。

可对查找的内容进行其他操作

document = document()  # 创建document实例
document.loadfromfile(inputfile)  # 加载word文档
 
textselections = document.findallstring("测试报告", false, true)
 
# 对找到的内容设置高亮显示颜色
for selection in textselections:
    selection.getasonerange().characterformat.highlightcolor = color.get_blue()
 
document.savetofile(outputfile, fileformat.docx)
document.close()

以上就是python如何调用spire.doc轻松读取word文档内容的详细内容，更多关于python读取word的资料请关注代码网其它相关文章！