SpringBoot+ElasticSearch实现文档内容抽取、高亮分词、全文检索_Java

需求
产品希望我们这边能够实现用户上传pdf、word、txt之内得文本内容，然后用户可以根据附件名称或文件内容模糊查询文件信息，并可以在线查看文件内容。

一、环境
项目开发环境：

后台管理系统springboot+mybatis_plus+mysql+es
搜索引擎：elasticsearch7.9.3 +kibana图形化界面

二、功能实现
1.搭建环境
es+kibana的搭建这里就不介绍了，网上多的是

后台程序搭建也不介绍，这里有一点很重要，java使用的连接es的包的版本一定要和es的版本对应上，不然你会有各种问题

2.文件内容识别
第一步：要用es实现文本附件内容的识别，需要先给es安装一个插件：ingest attachment processor plugin

这知识一个内容识别的插件，还有其它的例如ocr之类的其它插件，有兴趣的可以去搜一下了解一下

ingest attachment processor plugin是一个文本抽取插件，本质上是利用了elasticsearch的ingest node功能，提供了关键的预处理器attachment。在安装目录下运行以下命令即可安装。

到es的安装文件bin目录下执行

elasticsearch-plugin install ingest-attachment

因为我们这里es是使用docker安装的，所以需要进入到es的docker镜像里面的bin目录下安装插件

[root@izuf63d0pqnjrga4pi18udz plugins]# docker exec -it es bash
[root@elasticsearch elasticsearch]# ls
license.txt  notice.txt  readme.asciidoc  bin  config  data  jdk  lib  logs  modules  plugins
[root@elasticsearch elasticsearch]# cd bin/
[root@elasticsearch bin]# ls
elasticsearch          elasticsearch-certutil  elasticsearch-croneval  elasticsearch-env-from-file  elasticsearch-migrate  elasticsearch-plugin         elasticsearch-setup-passwords  elasticsearch-sql-cli            elasticsearch-syskeygen  x-pack-env           x-pack-watcher-env
elasticsearch-certgen  elasticsearch-cli       elasticsearch-env       elasticsearch-keystore       elasticsearch-node     elasticsearch-saml-metadata  elasticsearch-shard            elasticsearch-sql-cli-7.9.3.jar  elasticsearch-users      x-pack-security-env
[root@elasticsearch bin]# elasticsearch-plugin install ingest-attachment
-> installing ingest-attachment
-> downloading ingest-attachment from elastic
[=================================================] 100%?? 
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     warning: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.lang.runtimepermission accessclassinpackage.sun.java2d.cmm.kcms
* java.lang.runtimepermission accessdeclaredmembers
* java.lang.runtimepermission getclassloader
* java.lang.reflect.reflectpermission suppressaccesschecks
* java.security.securitypermission createaccesscontrolcontext
* java.security.securitypermission insertprovider
* java.security.securitypermission putproviderproperty.bc
see http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.
 
continue with installation? [y/n]y
-> installed ingest-attachment

显示installed 就表示安装完成了，然后重启es，不然第二步要报错

第二步：创建一个文本抽取的管道

主要是用于将上传的附件转换成文本内容，支持（word，pdf，txt，excel没试，应该也支持）
在这里插入图片描述

{
    "description": "extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "content",
                "ignore_missing": true
            }
        },
        {
            "remove": {
                "field": "content"
            }
        }
    ]
}

第三步：定义我们内容存储的索引
在这里插入图片描述

{
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "filename":{
        "type": "text",
        "analyzer": "my_ana"
      },
      "contenttype":{
        "type": "text",
         "analyzer": "my_ana"
      },
       "fileurl":{
        "type": "text"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "my_ana"
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "jieba_stop": {
          "type":        "stop",
          "stopwords_path": "stopword/stopwords.txt"
        },
        "jieba_synonym": {
          "type":        "synonym",
          "synonyms_path": "synonym/synonyms.txt"
        }
      },
      "analyzer": {
        "my_ana": {
          "tokenizer": "jieba_index",
          "filter": [
            "lowercase",
            "jieba_stop",
            "jieba_synonym"
          ]
        }
      }
    }
  }
}

mapping：定义的是存储的字段格式
setting:索引的配置信息，这边定义了一个分词（使用的是jieba的分词）

第四步：测试

在这里插入图片描述

{
    "id":"1",
 "name":"进口红酒",
 "filetype":"pdf",
    "contenttype":"文章",
 "content":"文章内容"
}

测试内容需要将附件转换成base64格式

在线转换文件的地址：https://www.zhangxinxu.com/sp/base64.html

查询刚刚上传的文件：
在这里插入图片描述

{
    "took": 861,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "fileinfo",
                "_type": "_doc",
                "_id": "lkpegyibz3nlbkqzxyx9",
                "_score": 1.0,
                "_source": {
                    "filename": "测试_20220809164145a002.docx",
                    "updatetime": 1660034506000,
                    "attachment": {
                        "date": "2022-08-09t01:38:00z",
                        "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                        "author": "dell",
                        "language": "lt",
                        "content": "内容",
                        "content_length": 2572
                    },
                    "createtime": 1660034506000,
                    "fileurl": "http://localhost:8092/fileinfo/profile/upload/fileinfo/2022/08/09/测试_20220809164145a002.docx",
                    "id": 1306333192,
                    "contenttype": "文章",
                    "filetype": "docx"
                }
            },
            {
                "_index": "fileinfo",
                "_type": "_doc",
                "_id": "muphgyibz3nlbkqzwivw",
                "_score": 1.0,
                "_source": {
                    "filename": "测试_20220809164527a001.docx",
                    "updatetime": 1660034728000,
                    "attachment": {
                        "date": "2022-08-09t01:38:00z",
                        "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                        "author": "dell",
                        "language": "lt",
                        "content": "内容",
                        "content_length": 2572
                    },
                    "createtime": 1660034728000,
                    "fileurl": "http://localhost:8092/fileinfo/profile/upload/fileinfo/2022/08/09/测试_20220809164527a001.docx",
                    "id": 1306333193,
                    "contenttype": "文章",
                    "filetype": "docx"
                }
            },
            {
                "_index": "fileinfo",
                "_type": "_doc",
                "_id": "jdqshoibbktnu1ugkzfk",
                "_score": 1.0,
                "_source": {
                    "filename": "txt测试_20220810153351a001.txt",
                    "updatetime": 1660116831000,
                    "attachment": {
                        "content_type": "text/plain; charset=utf-8",
                        "language": "lt",
                        "content": "内容",
                        "content_length": 804
                    },
                    "createtime": 1660116831000,
                    "fileurl": "http://localhost:8092/fileinfo/profile/upload/fileinfo/2022/08/10/txt测试_20220810153351a001.txt",
                    "id": 1306333194,
                    "contenttype": "告示",
                    "filetype": "txt"
                }
            }
        ]
    }
}

我们调用上传的接口，可以看到文本内容已经抽取到es里面了，后面就可以直接分词检索内容，高亮显示了

三.代码
介绍下代码实现逻辑：文件上传，数据库存储附件信息和附件上传地址；调用es实现文本内容抽取，将抽取的内容放到对应索引下；提供小程序全文检索的api实现根据文件名称关键词联想，文件名称内容全文检索模糊匹配，并高亮显示分词匹配字段；直接贴代码

yml配置文件：

# 数据源配置
spring:
    # 服务模块
    devtools:
        restart:
            # 热部署开关
            enabled: true
    # 搜索引擎
    elasticsearch:
        rest:
            url: 127.0.0.1
            uris: 127.0.0.1:9200
            connection-timeout: 1000
            read-timeout: 3000
            username: elastic
            password: 123456

elsticsearchconfig（连接配置）

package com.yj.rselasticsearch.domain.config;
 
import org.apache.http.httphost;
import org.apache.http.auth.authscope;
import org.apache.http.auth.usernamepasswordcredentials;
import org.apache.http.impl.client.basiccredentialsprovider;
import org.elasticsearch.client.restclient;
import org.elasticsearch.client.resthighlevelclient;
import org.springframework.beans.factory.annotation.value;
import org.springframework.context.annotation.bean;
import org.springframework.context.annotation.configuration;
 
import java.time.duration;
 
@configuration
public class elasticsearchconfig {
    @value("${spring.elasticsearch.rest.url}")
    private string edurl;
    @value("${spring.elasticsearch.rest.username}")
    private string username;
    @value("${spring.elasticsearch.rest.password}")
    private string password;
 
    @bean
    public resthighlevelclient resthighlevelclient() {
        //设置连接的用户名密码
        final basiccredentialsprovider credentialsprovider = new basiccredentialsprovider();
        credentialsprovider.setcredentials(authscope.any, new usernamepasswordcredentials(username, password));
        resthighlevelclient client =  new resthighlevelclient(restclient.builder(
                        new httphost(edurl, 9200,"http"))
                .sethttpclientconfigcallback(httpclientbuilder -> {
                    httpclientbuilder.disableauthcaching();
                    //保持连接池处于链接状态，该bug曾导致es一段时间没使用，第一次连接访问超时
                    httpclientbuilder.setkeepalivestrategy(((response, context) -> duration.ofminutes(5).tomillis()));
                    return httpclientbuilder.setdefaultcredentialsprovider(credentialsprovider);
                })
        );
        return client;
    }
}

文件上传保存文件信息并抽取内容到es

实体对象fileinfo

package com.yj.common.core.domain.entity;
 
import com.baomidou.mybatisplus.annotation.tablefield;
import com.yj.common.core.domain.baseentity;
import lombok.data;
import lombok.equalsandhashcode;
import lombok.getter;
import lombok.setter;
import org.springframework.data.elasticsearch.annotations.document;
import org.springframework.data.elasticsearch.annotations.field;
import org.springframework.data.elasticsearch.annotations.fieldtype;
 
import java.util.date;
 
@setter
@getter
@document(indexname = "fileinfo",createindex = false)
public class fileinfo {
    /**
    * 主键
    */
    @field(name = "id", type = fieldtype.integer)
    private integer id;
 
    /**
    * 文件名称
    */
    @field(name = "filename", type = fieldtype.text,analyzer = "jieba_index",searchanalyzer = "jieba_index")
    private string filename;
 
    /**
    * 文件类型
    */
    @field(name = "filetype",  type = fieldtype.keyword)
    private string filetype;
 
    /**
    * 内容类型
    */
    @field(name = "contenttype", type = fieldtype.text)
    private string contenttype;
 
    /**
     * 附件内容
     */
    @field(name = "attachment.content", type = fieldtype.text,analyzer = "jieba_index",searchanalyzer = "jieba_index")
    @tablefield(exist = false)
    private string content;
 
    /**
    * 文件地址
    */
    @field(name = "fileurl", type = fieldtype.text)
    private string fileurl;
 
    /**
     * 创建时间
     */
    private date createtime;
 
    /**
     * 更新时间
     */
    private date updatetime;
}

controller类

package com.yj.rselasticsearch.controller;
 
import com.yj.common.core.controller.basecontroller;
import com.yj.common.core.domain.ajaxresult;
import com.yj.common.core.domain.entity.fileinfo;
import com.yj.rselasticsearch.service.fileinfoservice;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.multipartfile;
 
import javax.annotation.resource;
 
/**
 * (file_info)表控制层
 *
 * @author xxxxx
 */
@restcontroller
@requestmapping("/fileinfo")
public class fileinfocontroller extends basecontroller {
    /**
     * 服务对象
     */
    @resource
    private fileinfoservice fileinfoservice;
 
 
    @putmapping("uploadfile")
    public ajaxresult uploadfile(string contenttype, multipartfile file) {
        return fileinfoservice.uploadfileinfo(contenttype,file);
    }
}

serviceimpl实现类

package com.yj.rselasticsearch.service.impl;
 
import com.alibaba.fastjson.json;
import com.baomidou.mybatisplus.core.conditions.query.lambdaquerywrapper;
import com.yj.common.config.ruoyiconfig;
import com.yj.common.core.domain.ajaxresult;
import com.yj.common.utils.fastutils;
import com.yj.common.utils.stringutils;
import com.yj.common.utils.file.fileuploadutils;
import com.yj.common.utils.file.fileutils;
import com.yj.framework.config.serverconfig;
import lombok.extern.slf4j.slf4j;
import org.elasticsearch.action.index.indexrequest;
import org.elasticsearch.action.index.indexresponse;
import org.elasticsearch.client.requestoptions;
import org.elasticsearch.client.resthighlevelclient;
import org.elasticsearch.common.xcontent.xcontenttype;
import org.springframework.beans.factory.annotation.autowired;
import org.springframework.beans.factory.annotation.qualifier;
import org.springframework.data.elasticsearch.core.elasticsearchresttemplate;
import org.springframework.stereotype.service;
import javax.annotation.resource;
import com.yj.common.core.domain.entity.fileinfo;
import com.yj.rselasticsearch.mapper.fileinfomapper;
import com.yj.rselasticsearch.service.fileinfoservice;
import org.springframework.web.multipart.multipartfile;
 
import java.io.file;
import java.io.fileinputstream;
import java.io.ioexception;
import java.util.base64;
 
@service
@slf4j
public class fileinfoserviceimpl implements fileinfoservice{
    @resource
    private serverconfig serverconfig;
 
    @autowired
    @qualifier("resthighlevelclient")
    private resthighlevelclient client;
 
    @resource
    private fileinfomapper fileinfomapper;
 
    /**
     * 上传文件并进行文件内容识别上传到es
     * @param contenttype
     * @param file
     * @return
     */
    @override
    public ajaxresult uploadfileinfo(string contenttype, multipartfile file) {
        if (fastutils.checknullorempty(contenttype,file)){
            return ajaxresult.error("请求参数不能为空");
        }
        try {
            // 上传文件路径
            string filepath = ruoyiconfig.getuploadpath() + "/fileinfo";
            fileinfo fileinfo = new fileinfo();
            // 上传并返回新文件名称
            string filename = fileuploadutils.upload(filepath, file);
            string prefix = filename.substring(filename.lastindexof(".")+1);
            file files = file.createtempfile(filename, prefix);
            file.transferto(files);
            string url = serverconfig.geturl() + "/fileinfo" + filename;
            fileinfo.setfilename(fileutils.getname(filename));
            fileinfo.setfiletype(prefix);
            fileinfo.setfileurl(url);
            fileinfo.setcontenttype(contenttype);
            int result = fileinfomapper.insertselective(fileinfo);
            if (result > 0) {
                fileinfo = fileinfomapper.selectone(new lambdaquerywrapper<fileinfo>().eq(fileinfo::getfileurl,fileinfo.getfileurl()));
                byte[] bytes = getcontent(files);
                string base64 = base64.getencoder().encodetostring(bytes);
                fileinfo.setcontent(base64);
                indexrequest indexrequest = new indexrequest("fileinfo");
                //上传同时，使用attachment pipline进行提取文件
                indexrequest.source(json.tojsonstring(fileinfo), xcontenttype.json);
                indexrequest.setpipeline("attachment");
                indexresponse indexresponse = client.index(indexrequest, requestoptions.default);
                log.info("indexresponse:" + indexresponse);
            }
            ajaxresult ajax = ajaxresult.success(fileinfo);
            return ajax;
        } catch (exception e) {
            return ajaxresult.error(e.getmessage());
        }
    }
 
 
     /**
     * 文件转base64
     *
     * @param file
     * @return
     * @throws ioexception
     */
    private byte[] getcontent(file file) throws ioexception {
 
        long filesize = file.length();
        if (filesize > integer.max_value) {
            log.info("file too big...");
            return null;
        }
        fileinputstream fi = new fileinputstream(file);
        byte[] buffer = new byte[(int) filesize];
        int offset = 0;
        int numread = 0;
        while (offset < buffer.length
                && (numread = fi.read(buffer, offset, buffer.length - offset)) >= 0) {
            offset += numread;
        }
        // 确保所有数据均被读取
        if (offset != buffer.length) {
            throw new serviceexception("could not completely read file "
                    + file.getname());
        }
        fi.close();
        return buffer;
    }
}

高亮分词检索

参数请求warninginfodto

package com.yj.rselasticsearch.domain.dto;
 
import com.yj.common.core.domain.entity.warninginfo;
import io.swagger.annotations.apimodel;
import io.swagger.annotations.apimodelproperty;
import lombok.data;
 
import java.util.list;
 
/**
 * 前端请求数据传输
 * warninginfo
 * @author luoy
 */
@data
@apimodel(value ="warninginfodto",description = "告警信息")
public class warninginfodto{
    /**
     * 页数
     */
    @apimodelproperty("页数")
    private integer pageindex;
 
    /**
     * 每页数量
     */
    @apimodelproperty("每页数量")
    private integer pagesize;
 
    /**
     * 查询关键词
     */
    @apimodelproperty("查询关键词")
    private string keyword;
 
    /**
     * 内容类型
     */
    private list<string> contenttype;
 
    /**
     * 用户手机号
     */
    private string phone;
}

controller类

package com.yj.rselasticsearch.controller;
 
import com.baomidou.mybatisplus.core.metadata.ipage;
import com.yj.common.core.controller.basecontroller;
import com.yj.common.core.domain.ajaxresult;
import com.yj.common.core.domain.entity.fileinfo;
import com.yj.common.core.domain.entity.warninginfo;
import com.yj.rselasticsearch.service.elasticsearchservice;
import com.yj.rselasticsearch.service.warninginfoservice;
import io.swagger.annotations.api;
import io.swagger.annotations.apiimplicitparam;
import io.swagger.annotations.apiimplicitparams;
import io.swagger.annotations.apioperation;
import org.springframework.web.bind.annotation.*;
import com.yj.rselasticsearch.domain.dto.warninginfodto;
 
import javax.annotation.resource;
import javax.servlet.http.httpservletrequest;
import java.util.list;
 
/**
 * es搜索引擎
 *
 * @author luoy
 */
@api("搜索引擎")
@restcontroller
@requestmapping("es")
public class elasticsearchcontroller extends basecontroller {
    @resource
    private elasticsearchservice elasticsearchservice;
 
    /**
     * 告警信息关键词联想
     *
     * @param warninginfodto
     * @return
     */
    @apioperation("关键词联想")
    @apiimplicitparams({
            @apiimplicitparam(name = "contenttype", value = "文档类型", required = true, datatype = "string", datatypeclass = string.class),
            @apiimplicitparam(name = "keyword", value = "关键词", required = true, datatype = "string", datatypeclass = string.class)
    })
    @postmapping("getassociationalworddoc")
    public ajaxresult getassociationalworddoc(@requestbody warninginfodto warninginfodto, httpservletrequest request) {
        list<string> words = elasticsearchservice.getassociationalwordother(warninginfodto,request);
        return ajaxresult.success(words);
    }
 
 
    /**
     * 告警信息高亮分词分页查询
     *
     * @param warninginfodto
     * @return
     */
    @apioperation("高亮分词分页查询")
    @apiimplicitparams({
            @apiimplicitparam(name = "keyword", value = "关键词", required = true, datatype = "string", datatypeclass = string.class),
            @apiimplicitparam(name = "pageindex", value = "页码", required = true, datatype = "integer", datatypeclass = integer.class),
            @apiimplicitparam(name = "pagesize", value = "页数", required = true, datatype = "integer", datatypeclass = integer.class),
            @apiimplicitparam(name = "contenttype", value = "文档类型", required = true, datatype = "string", datatypeclass = string.class)
    })
    @postmapping("queryhighlightworddoc")
    public ajaxresult queryhighlightworddoc(@requestbody warninginfodto warninginfodto,httpservletrequest request) {
        ipage<fileinfo> warninginfolistpage = elasticsearchservice.queryhighlightwordother(warninginfodto,request);
        return ajaxresult.success(warninginfolistpage);
    }
}

serviceimpl实现类

package com.yj.rselasticsearch.service.impl;
 
import com.alibaba.fastjson.json;
import com.baomidou.mybatisplus.core.conditions.query.lambdaquerywrapper;
import com.baomidou.mybatisplus.core.metadata.ipage;
import com.baomidou.mybatisplus.extension.plugins.pagination.page;
import com.yj.common.constant.dataconstants;
import com.yj.common.constant.httpstatus;
import com.yj.common.core.domain.entity.fileinfo;
import com.yj.common.core.domain.entity.warninginfo;
import com.yj.common.core.domain.entity.whitelist;
import com.yj.common.core.redis.rediscache;
import com.yj.common.exception.serviceexception;
import com.yj.common.utils.fastutils;
import com.yj.rselasticsearch.domain.dto.retrievalrecorddto;
import com.yj.rselasticsearch.domain.dto.warninginfodto;
import com.yj.rselasticsearch.domain.vo.membervo;
import com.yj.rselasticsearch.service.*;
import lombok.extern.slf4j.slf4j;
import org.elasticsearch.action.bulk.bulkrequest;
import org.elasticsearch.action.bulk.bulkresponse;
import org.elasticsearch.action.index.indexrequest;
import org.elasticsearch.client.requestoptions;
import org.elasticsearch.client.resthighlevelclient;
import org.elasticsearch.common.xcontent.xcontenttype;
import org.elasticsearch.index.query.boolquerybuilder;
import org.elasticsearch.index.query.operator;
import org.elasticsearch.index.query.querybuilders;
import org.elasticsearch.search.fetch.subphase.highlight.highlightbuilder;
import org.springframework.beans.factory.annotation.autowired;
import org.springframework.beans.factory.annotation.qualifier;
import org.springframework.data.domain.pagerequest;
import org.springframework.data.domain.pageable;
import org.springframework.data.elasticsearch.core.elasticsearchresttemplate;
import org.springframework.data.elasticsearch.core.searchhits;
import org.springframework.data.elasticsearch.core.query.*;
import org.springframework.stereotype.service;
 
import javax.annotation.resource;
import javax.servlet.http.httpservletrequest;
import java.util.*;
import java.util.stream.collectors;
 
@service
@slf4j
public class elasticsearchserviceimpl implements elasticsearchservice {
 
    @resource
    private whitelistservice whitelistservice;
 
    @autowired
    @qualifier("resthighlevelclient")
    private resthighlevelclient client;
 
    @autowired
    private rediscache rediscache;
 
    @resource
    private tokenservice tokenservice;
 
 
    /**
     * 文档信息关键词联想(根据输入框的词语联想文件名称)
     *
     * @param warninginfodto
     * @return
     */
    @override
    public list<string> getassociationalwordother(warninginfodto warninginfodto, httpservletrequest request) {
        //需要查询的字段
        boolquerybuilder boolquerybuilder = querybuilders.boolquery()
                .should(querybuilders.matchboolprefixquery("filename", warninginfodto.getkeyword()));
        //contenttype标签内容过滤
        boolquerybuilder.must(querybuilders.termsquery("contenttype", warninginfodto.getcontenttype()));
        //构建高亮查询
        nativesearchquery searchquery = new nativesearchquerybuilder()
                .withquery(boolquerybuilder)
                .withhighlightfields(
                        new highlightbuilder.field("filename")
                )
                .withhighlightbuilder(new highlightbuilder().pretags("<span style='color:red'>").posttags("</span>"))
                .build();
        //查询
        searchhits<fileinfo> search = null;
        try {
            search = elasticsearchresttemplate.search(searchquery, fileinfo.class);
        } catch (exception ex) {
            ex.printstacktrace();
            throw new serviceexception(string.format("操作错误，请联系管理员！%s", ex.getmessage()));
        }
        //设置一个最后需要返回的实体类集合
        list<string> resultlist = new linkedlist<>();
        //遍历返回的内容进行处理
        for (org.springframework.data.elasticsearch.core.searchhit<fileinfo> searchhit : search.getsearchhits()) {
            //高亮的内容
            map<string, list<string>> highlightfields = searchhit.gethighlightfields();
            //将高亮的内容填充到content中
            searchhit.getcontent().setfilename(highlightfields.get("filename") == null ? searchhit.getcontent().getfilename() : highlightfields.get("filename").get(0));
            if (highlightfields.get("filename") != null) {
                resultlist.add(searchhit.getcontent().getfilename());
            }
        }
        //list去重
        list<string> newresult = null;
        if (!fastutils.checknullorempty(resultlist)) {
            if (resultlist.size() > 9) {
                newresult = resultlist.stream().distinct().collect(collectors.tolist()).sublist(0, 9);
            } else {
                newresult = resultlist.stream().distinct().collect(collectors.tolist());
            }
        }
        return newresult;
    }
 
    /**
     * 高亮分词搜索其它类型文档
     *
     * @param warninginfodto
     * @param request
     * @return
     */
    @override
    public ipage<fileinfo> queryhighlightwordother(warninginfodto warninginfodto, httpservletrequest request) {
        //分页
        pageable pageable = pagerequest.of(warninginfodto.getpageindex() - 1, warninginfodto.getpagesize());
         //需要查询的字段，根据输入的内容分词全文检索filename和content字段
        boolquerybuilder boolquerybuilder = querybuilders.boolquery()
                .should(querybuilders.matchboolprefixquery("filename", warninginfodto.getkeyword()))
                .should(querybuilders.matchboolprefixquery("attachment.content", warninginfodto.getkeyword()));
        //contenttype标签内容过滤
        boolquerybuilder.must(querybuilders.termsquery("contenttype", warninginfodto.getcontenttype()));
        //构建高亮查询
        nativesearchquery searchquery = new nativesearchquerybuilder()
                .withquery(boolquerybuilder)
                .withhighlightfields(
                        new highlightbuilder.field("filename"), new highlightbuilder.field("attachment.content")
                )
                .withhighlightbuilder(new highlightbuilder().pretags("<span style='color:red'>").posttags("</span>"))
                .build();
        //查询
        searchhits<fileinfo> search = null;
        try {
            search = elasticsearchresttemplate.search(searchquery, fileinfo.class);
        } catch (exception ex) {
            ex.printstacktrace();
            throw new serviceexception(string.format("操作错误，请联系管理员！%s", ex.getmessage()));
        }
        //设置一个最后需要返回的实体类集合
        list<fileinfo> resultlist = new linkedlist<>();
        //遍历返回的内容进行处理
        for (org.springframework.data.elasticsearch.core.searchhit<fileinfo> searchhit : search.getsearchhits()) {
            //高亮的内容
            map<string, list<string>> highlightfields = searchhit.gethighlightfields();
            //将高亮的内容填充到content中
            searchhit.getcontent().setfilename(highlightfields.get("filename") == null ? searchhit.getcontent().getfilename() : highlightfields.get("filename").get(0));
            searchhit.getcontent().setcontent(highlightfields.get("content") == null ? searchhit.getcontent().getcontent() : highlightfields.get("content").get(0));
            resultlist.add(searchhit.getcontent());
        }
        //手动分页返回信息
        ipage<fileinfo> warninginfoipage = new page<>();
        warninginfoipage.settotal(search.gettotalhits());
        warninginfoipage.setrecords(resultlist);
        warninginfoipage.setcurrent(warninginfodto.getpageindex());
        warninginfoipage.setsize(warninginfodto.getpagesize());
        warninginfoipage.setpages(warninginfoipage.gettotal() % warninginfodto.getpagesize());
        return warninginfoipage;
    }
}

代码测试：
在这里插入图片描述

--请求jason
{
    "keyword":"全库备份",
    "contenttype":["告示"],
    "pageindex":1,
    "pagesize":10
}
 
 
--响应
{
    "msg": "操作成功",
    "code": 200,
    "data": {
        "records": [
            {
                "id": 1306333194,
                "filename": "txt测试_20220810153351a001.txt",
                "filetype": "txt",
                "contenttype": "告示",
                "content": "•\t秒级快速<span style='color:red'>备份</span>\r\n不论多大的数据量，<span style='color:red'>全库</span><span style='color:red'>备份</span>只需30秒，而且<span style='color:red'>备份过程</span>不会对数据库加锁，对应用程序几乎无影响，全天24小时均可进行<span style='color:red'>备份</span>。",
                "fileurl": "http://localhost:8092/fileinfo/profile/upload/fileinfo/2022/08/10/txt测试_20220810153351a001.txt",
                "createtime": "2022-08-10t15:33:51.000+08:00",
                "updatetime": "2022-08-10t15:33:51.000+08:00"
            }
        ],
        "total": 1,
        "size": 10,
        "current": 1,
        "orders": [],
        "optimizecountsql": true,
        "searchcount": true,
        "countid": null,
        "maxlimit": null,
        "pages": 1
    }
}

返回的内容将分词检索到匹配的内容，并将匹配的词高亮显示。

SpringBoot+ElasticSearch实现文档内容抽取、高亮分词、全文检索

2024年08月06日 • Java •我要评论

相关文章:

idea新建一个JavaEE项目以及基本配置

IDEA配置Maven教程（超详细版~)

intellij idea 使用git撤销(取消)commit

JetBrains IDEA 新旧UI切换

发表评论


验证码：