【项目日记（二）】搜索引擎-索引制作_搜素引擎

在这里插入图片描述

1.前言

2.索引结构

创建index类，通过这个类来构建索引结构
基本步骤：

用arraylist创建正排索引
用hashmap创建倒排索引
1.给定docid在正排索引中，查询详细信息
2.给定一个词，在倒排索引中查与这个词的关联文档
3.往索引中新增文档
4.把内存的索引保存到磁盘
5.把磁盘的索引结构保存到内存

2.1创捷索引

正排索引

private arraylist<docinfo> forwardindex=new arraylist<>();

倒排索引

private hashmap<string,arraylist<weight>> invertedindex=new hashmap<>();

docinfo类：

public class docinfo {
    private int docid;
    private string title;
    private string url;
    private string content;
    public int getdocid() {
        return docid;
    }
    public void setdocid(int docid) {
        this.docid = docid;
    }
    public string gettitle() {
        return title;
    }
    public void settitle(string title) {
        this.title = title;
    }
    public string geturl() {
        return url;
    }
    public void seturl(string url) {
        this.url = url;
    }
    public string getcontent() {
        return content;
    }
    public void setcontent(string content) {
        this.content = content;
    }
}

weight类：

public class weight {
    private int docid;
    private int weight;
    public int getdocid() {
        return docid;
    }
    public void setdocid(int docid) {
        this.docid = docid;
    }
    public int getweight() {
        return weight;
    }
    public void setweight(int weight) {
        this.weight = weight;
    }
}

2.2根据索引查询

//1.根据docid查询文档详情，数组小标就是文档id
    public docinfo getdocinfo(int docid){
        return forwardindex.get(docid);
    }
//2.给定一个词，查在哪些文档中
    public list<weight> getinverted(string term){
        return invertedindex.get(term);
    }

2.3新增文档

 public void adddoc(string title,string url,string content){
        //给正排索引新增和倒排索引都新增信息
        //构建正排索引
        docinfo docinfo=buildforward(title,url,content);
        //创建倒排索引
        buildinverted(docinfo);
    }

在正排索引中添加文档：

private docinfo buildforward(string title, string url, string content) {
        docinfo docinfo=new docinfo();
        docinfo.settitle(title);
        docinfo.seturl(url);
        docinfo.setcontent(content);
        //巧妙设计：docinfoid的下标和数组下标一一对应
        docinfo.setdocid(forwardindex.size());
        forwardindex.add(docinfo);
        return docinfo;
    }

在倒排索引中新增文档
1.需要统计每一个词在文档中的出现次数，在根据次数算出权重
2.首先进行分词操作，统计每一个不同的词在标题中出现的次数
3.再进行分词操作，统计每一个词在正文出现的次数
4.设置权重为标题次数*10+正文次数

 private void buildinverted(docinfo docinfo) {
        class wordcnt{
        public int titlecount;
        public int contentcount;
        }
        hashmap<string,wordcnt> wordcnthashmap=new hashmap<>();
        //1.针对标题进行分词操作
        list<term> terms= toanalysis.parse(docinfo.gettitle()).getterms();
        //2.针对分词结果，统计每个词出现的次数
        for (term term:terms){
            string word=term.getname();
            wordcnt wordcnt=wordcnthashmap.get(word);
            if (wordcnt==null){
                wordcnt newwordcnt=new wordcnt();
                newwordcnt.titlecount=1;
                newwordcnt.contentcount=0;
                wordcnthashmap.put(word,newwordcnt);
            }else {
                wordcnt.titlecount+=1;
            }
        }
        //3.针对正文进行分词操作
        list<term> terms2=toanalysis.parse(docinfo.getcontent()).getterms();
        //4.遍历分词结果，统计每个词出现的次数
        for (term term:terms2){
            string word=term.getname();
            wordcnt wordcnt=wordcnthashmap.get(word);
            if (wordcnt==null){
                wordcnt newwordcnt=new wordcnt();
                newwordcnt.titlecount=0;
                newwordcnt.contentcount=1;
                wordcnthashmap.put(word,newwordcnt);
            }else {
                wordcnt.contentcount+=1;
            }
        }
        //5.设置权重为：标题*10+正文
        //一个对象必须实现了iterable接口才能使用for each进行遍历，而map并没有实现该接口，但set实现了，所以就把map转换为set
        for(map.entry<string,wordcnt> entry:wordcnthashmap.entryset()) {            
     list<weight> invertedlist=invertedindex.get(entry.getkey());
                if (invertedlist==null){
                    arraylist<weight> newinvertedlist=new arraylist<>();
                    weight weight=new weight();
                    weight.setweight(entry.getvalue().titlecount*10+entry.getvalue().contentcount);
                    weight.setdocid(docinfo.getdocid());
                    newinvertedlist.add(weight);
                    invertedindex.put(entry.getkey(),newinvertedlist);
                }else {
                    weight weight=new weight();
                    weight.setdocid(docinfo.getdocid());
                    weight.setweight(entry.getvalue().titlecount*10+entry.getvalue().contentcount);
                    invertedlist.add(weight);
                }        
        }
    }

2.4内存索引保存到磁盘

索引当前是存储在内存中的，构造索引的过程是非常耗时的，因此我们就不应该再服务器启动时才去构造索引，通常就把这些耗时的操作，单独执行完成之后，然后再让线上的服务器加载构造好的索引。
我们就把内存中构造的索引结构，给变成一个字符串，然后写入文件即可，这个操作就叫序列化。适应jackson中的objectmapper来完成此操作。

 private static  string index_path="d:/doc_searcher_index/";
 public void save(){
        long beg=system.currenttimemillis();
        system.out.println("保存索引开始！");
        file indexpathfile=new file(index_path);
        if(!indexpathfile.exists()){
            indexpathfile.mkdir();
        }
        file forwardindexfile=new file(index_path+"forward.txt");
        file invertedindexfile=new file(index_path+"inverted.txt");
        try {
        //利用objectmapperjava对象转换为json格式
        //从内存中读取forwardindex保存到forwardindexfile
            objectmapper.writevalue(forwardindexfile,forwardindex);
        //从内存中读取nvertedindex保存到invertedindexfile
            forwardindexfileobjectmapper.writevalue(invertedindexfile,invertedindex);
        }catch (ioexception e) {
            e.printstacktrace();
        }
        long end=system.currenttimemillis();
        system.out.println("保存索引完成！消耗时间："+(end-beg)+"ms");
    }

2.5把磁盘索引加载到内存

public void load(){
        long beg=system.currenttimemillis();
        system.out.println("加载索引开始！");
        file forwardindexfile=new file(index_path+"forward.txt");
        file invertedindexfile=new file(index_path+"inverted.txt");
        try {
            forwardindex=objectmapper.readvalue(forwardindexfile, new typereference<arraylist<docinfo>>() {});
            invertedindex=objectmapper.readvalue(invertedindexfile, new typereference<hashmap<string, arraylist<weight>>>() {});
        }catch (ioexception e){
            e.printstacktrace();
        }
        long end=system.currenttimemillis();
        system.out.println("加载引擎结束消耗时间："+(end-beg)+"ms");
    }

parser相当于制作索引的入口，对应到一个可执行的程序
index相当于实现了索引的数据结构，提供一些api
接下来我们就在parser里面调用对应的api
在parser类中解析完html文件时，应添加到索引中

private  void parsehtml(file f) {
        //1.解析html标题
        string title=parsetitle(f);
        //2.解析html的url
        string url=parseurl(f);
        //3.解析html的正文
        long beg=system.nanotime();
        //string content=parsecontent(f);
        string content=parsecontentbyregex(f);
        long mid=system.nanotime();
        //把解析出来的信息加载到索引
        index.adddoc(title,url,content);         
    }

在添加完索引之后，应该把索引保存到磁盘

public void run()  {
        long beg=system.currenttimemillis();
        system.out.println("索引制作开始");
        //1.枚举出input_path下的所有html文件
        arraylist<file> filelist=new arraylist<>();
        enumfile(input_path,filelist);
        //2.解析文档内容
        for (file f:filelist){
            system.out.println("开始解析"+f.getabsolutepath()+"....");
            parsehtml(f);
        }
        //3.把内存构造的索引保存到磁盘
        index.save();
        long end=system.currenttimemillis();
        system.out.println("索引制作结束！消耗时间："+(end-beg)+"ms");
    }

3.性能优化

此时我们已经完成了文档解析和索引制作模块，那么我们进行验证
在这里插入图片描述
文档内容正确生成：

但我们观察索引制作的时间：一个消耗了19973ms就是19s，花费的时间是比较长的，那么有什么办法提高效率呢？方法当然是有的，首先我们得清楚具体是哪一个步骤拖慢了执行效率，我们来分析代码：
在这里插入图片描述
可以看到解析文档的时候从磁盘读文件，循环遍历文件操作，那么显然效率是非常慢的，既然一个线程串行执行效率非常慢，那么我们就采用多线程并发执行来提高效率。

3.1多线程

我们可以使用创建一个线程池来实现并发操作。通过submit往线程池中提价任务，操作极快（只是把runnable对象放入阻塞队列中）。
把代码改进成多线程的版本，线程池中的线程数目具体设置成多少才合适呢？最好通过实验来确定。

public void run()  {
        long beg=system.currenttimemillis();
        system.out.println("索引制作开始");
        //1.枚举出input_path下的所有html文件
        arraylist<file> filelist=new arraylist<>();
        enumfile(input_path,filelist);
         executorservice executorservice= executors.newfixedthreadpool(6);
        //2.解析文档内容
        for (file f:files){
            executorservice.submit(new runnable() {
                @override
                public void run() {
                    system.out.println("解析："+f.getabsolutepath());
                    parsehtml(f);                    
                }
            });
        }
        //3.把内存构造的索引保存到磁盘
        index.save();
        long end=system.currenttimemillis();
        system.out.println("索引制作结束！消耗时间："+(end-beg)+"ms");
    }

3.2线程安全

我们既然引入了多线程就要考虑线程安全问题，要注意修改操作和读写操作。当多个线程同时尝试修改同一个共享数据时，需要确保数据的一致性，避免出现竞态条件。读写操作：如果一个线程在读取共享数据的同时另一个线程在修改该数据，可能导致读取到不一致或无效的数据。
那么我们就需要对程序进行加锁操作：
在这里插入图片描述

3.3countdownlatch类

添加锁虽然解决了线程安全问题，依然有新的问题，那就是在所有文件提交完成后就会立即执行save（）操作，但是可能文件解析还没有完成。为了解决这样的问题，我们就引入 countdownlatch类。
countdownlatch类类似于跑步比赛的裁判，只有所有的选手都撞线了，就认为这场比赛结束了。再构造 countdownlatch的时候指定一下比赛选手的个数，每个选手撞线都要通知一下countdown（），通过await来等待所有的选手都撞线完毕才执行save（）操作。

public void runbythread(){
        long beg=system.currenttimemillis();
        system.out.println("索引开始制作");
        //1.枚举出input_path下的所有html文件
        arraylist<file> files=new arraylist<>();
        enumfile(input_path,files);
        //2.解析文档内容
        countdownlatch latch=new countdownlatch(files.size());
        executorservice executorservice= executors.newfixedthreadpool(6);
        for (file f:files){
            executorservice.submit(new runnable() {
                @override
                public void run() {
                    system.out.println("解析："+f.getabsolutepath());
                    parsehtml(f);
                    latch.countdown();
                }
            });
        }
        try {
            //await会阻塞，把所有选手都调用countdown以后才会继续执行
            latch.await();
        } catch (interruptedexception e) {
            throw new runtimeexception(e);
        }
        //手动的把线程池里面的线程杀掉
        executorservice.shutdown();
        //3.把内存构造的索引保存到磁盘
        index.save();
        long end=system.currenttimemillis();
        system.out.println("索引制作结束！消耗时间："+(end-beg)+"ms");
        system.out.println("t1:"+t1+"t2"+t2);
    }

4.总结

下期预告：项目日记(三)

【项目日记（二）】搜索引擎-索引制作

2024年08月01日 • 搜素引擎 •我要评论

目录

1.前言

2.索引结构

2.1创捷索引

2.2根据索引查询

2.3新增文档

2.4内存索引保存到磁盘

2.5把磁盘索引加载到内存

3.性能优化

3.1多线程

3.2线程安全

3.3countdownlatch类

4.总结

相关文章:

双人成行UE4-nuts game已崩溃怎么解决？双人成行UE4-nuts game崩溃原因及解决方法分享

【虚幻引擎】UE4初学者系列教程(蓝图篇)-全中文新手入门教程（含蓝图节点详细介绍，即时更新中~）

融云：应用出海新增长引擎，GPT-4o 后的 AI 创新与用户运营

大数据ETL工具对比——SeaTunnel、DataX、Sqoop、Flume、Flink CDC、Dlinky、TIS、Chunjun等对比

触摸OpenNJet，感悟云原生

Faster-Whisper 实时识别电脑语音转文本

发表评论


验证码：