通过C#代码轻松提取PDF文本_C#

pdf格式因其跨平台兼容性强、安全性高等特点而被广泛使用。但pdf文档不易编辑，因此提取pdf文档的文本从而进行操作是一个常见的需求。提取pdf中的文本可以帮助我们对pdf文档进行内容再利用，从而制作新的pdf文档或将内容插入到其他格式的文档中。在.net平台，我们可以使用简单的c#代码轻松实现pdf文档文本的提取。

本文所使用的方法需要用到free spire.pdf for .net，nuget：pm> install-package freespire.pdf。

用c#提取pdf文本的操作步骤

库中提供了pdftextextractor类来处理pdf文档的文本提取。我们可以使用页面创建pdftextextractor对象，然后使用pdftextextractor.extracttext()方法来提取页面文本。同时，pdftextextractoptions类能对提取选项进行设置，如设置是否保留布局和设置提取的页面区域。以下是一般操作步骤。

创建pdfdocument对象。

使用pdfdocument.loadfromfile()方法载入pdf文档。

使用pdfdocument.pages[]属性获取指定页面，也可以遍历所有页面。

使用页面创建pdftextextractor对象。

创建pdftextextractoptions对象并设置提取选项。

使用pdftextextractor.extracttext(pdftextextractoptions)方法提取页面文本。

将提取的文本写入文件或用于其他用途。

释放资源。

提取pdf文本不保留文本布局

如果需要不保留文本布局直接提取文本内容，可以将pdftextextractoptions.issimpleextraction属性设置为true来实现。以下是代码示例：

using spire.pdf;
using spire.pdf.texts;
using system.text;

namespace extractpdftext
{
    class program
    {
        static void main(string[] args)
        {
            // 创建pdfdocument对象
            pdfdocument pdf = new pdfdocument();

            // 载入pdf文档
            pdf.loadfromfile("sample.pdf");

            // 创建pdftextextractoptions对象，并设置不保留布局
            pdftextextractoptions extractoptions = new pdftextextractoptions();
            extractoptions.issimpleextraction = true;

            // 创建stringbuilder对象以储存提取的文本
            stringbuilder extractedtext = new stringbuilder();

            // 遍历文档页面
            foreach (pdfpagebase page in pdf.pages)
            {
                // 使用页面创建pdftextextractor对象
                pdftextextractor extractor = new pdftextextractor(page);
                // 提取当前页面的文本
                string text = extractor.extracttext(extractoptions);
                // 将提取到的文本添加到stringbuilder对象
                extractedtext.append(text);
            }

            // 将提取结果写入文本文件
            using (streamwriter writer = new streamwriter("output/extractedpdftext.txt", false, encoding.utf8))
            {
                writer.write(extractedtext.tostring());
            }

            // 释放资源
            pdf.close();
        }
    }
}

结果

保留文本布局提取pdf文本

如果在提取pdf文本时需要保留pdf文本的布局（使用空格填补），则可以直接使用默认的提取选项提取pdf文本。以下是代码示例：

using spire.pdf;
using spire.pdf.texts;
using system.text;

namespace extractpdftextandlayout
{
    class program
    {
        static void main(string[] args)
        {
            // 创建pdfdocument对象
            pdfdocument pdf = new pdfdocument();

            // 载入pdf文档
            pdf.loadfromfile("sample.pdf");

            // 创建文本提取选项
            pdftextextractoptions extractoptions = new pdftextextractoptions();

            // 创建stringbuilder对象以储存提取的文本
            stringbuilder extractedtext = new stringbuilder();

            // 遍历文档页面
            foreach (pdfpagebase page in pdf.pages)
            {
                // 使用页面创建pdftextextractor对象
                pdftextextractor extractor = new pdftextextractor(page);
                // 提取当前页面的文本
                string text = extractor.extracttext(extractoptions);
                // 将提取到的文本添加到stringbuilder对象
                extractedtext.append(text);
            }

            // 将提取结果写入文本文件
            using (streamwriter writer = new streamwriter("output/extractedpdftext.txt", false, encoding.utf8))
            {
                writer.write(extractedtext.tostring());
            }

            // 释放资源
            pdf.close();
        }
    }
}

结果

提取pdf页面指定区域内的文本

我们还可以通过pdftextextractoptions.extractarea属性设置提取区域，从而实现提取页面上指定区域内的文本。以下是代码示例：

using spire.pdf.texts;
using spire.pdf;

using system.drawing;
using system.text;

namespace extractpdftextarea
{
    class program
    {
        static void main(string[] args)
        {
            // 创建pdfdocument对象
            pdfdocument pdf = new pdfdocument();

            // 载入pdf文档
            pdf.loadfromfile("sample.pdf");

            // 获取指定页面
            pdfpagebase page = pdf.pages[0];

            // 创建pdftextextractor对象
            pdftextextractor extractor = new pdftextextractor(page);

            // 创建pdftextextractoptions对象
            pdftextextractoptions extractoptions = new pdftextextractoptions();
            // 设置要提取文本的矩形区域
            extractoptions.extractarea = new rectanglef(80, 100, 250, 150);

            // 提取页面上指定位置的文本
            string extractedtext = extractor.extracttext(extractoptions);

            // 将提取的文本写入文本文件
            file.writealltext("output/extractpdfpageareatext.txt", extractedtext, encoding.utf8);

            // 释放资源
            pdf.close();
        }
    }
}

结果