Lucene学习总结之七:Lucene搜索过程解析(6)



Lucene学习总结之七:Lucene搜索过程解析(6)

2.4.4、收集文档结果集合及计算打分

在函数IndexSearcher.search(Weight, Filter, int) 中,有如下代码:
TopScoreDocCollector collector = TopScoreDocCollector.create(nDocs, !weight.scoresDocsOutOfOrder());
search(weight, filter, collector);
return collector.topDocs();
2.4.4.1、创建结果文档收集器
TopScoreDocCollector collector = TopScoreDocCollector.create(nDocs, !weight.scoresDocsOutOfOrder());
public static TopScoreDocCollector create(int numHits, boolean docsScoredInOrder) {
  if (docsScoredInOrder) {
    return new InOrderTopScoreDocCollector(numHits);
  } else {
    return new OutOfOrderTopScoreDocCollector(numHits);
  }
}
其根据是否按照文档号从小到大返回文档而创建InOrderTopScoreDocCollector或者OutOfOrderTopScoreDocCollector,两者的不同在于收集文档的方式不同。
2.4.4.2、收集文档号
当创建完毕Scorer对象树和SumScorer对象树后,IndexSearcher.search(Weight, Filter, Collector) 有以下调用:
scorer.score(collector) ,如下代码所示,其不断的得到合并的倒排表后的文档号,并收集它们。
public void score(Collector collector) throws IOException {
  collector.setScorer(this);
  while ((doc = countingSumScorer.nextDoc()) != NO_MORE_DOCS) {
    collector.collect(doc);
  }
}
InOrderTopScoreDocCollector的collect函数如下:
public void collect(int doc) throws IOException {
  float score = scorer.score();
  totalHits++;
  if (score <= pqTop.score) {
    return;
  }
  pqTop.doc = doc + docBase;
  pqTop.score = score;
  pqTop = pq.updateTop();
}
OutOfOrderTopScoreDocCollector的collect函数如下:
public void collect(int doc) throws IOException {
  float score = scorer.score();
  totalHits++;
  doc += docBase;
  if (score < pqTop.score || (score == pqTop.score && doc > pqTop.doc)) {
    return;
  }
  pqTop.doc = doc;
  pqTop.score = score;
  pqTop = pq.updateTop();
}
从上面的代码可以看出,collector的作用就是首先计算文档的打分,然后根据打分,将文档放入优先级队列(最小堆)中,最后在优先级队列中取前N篇文档。
然而存在一个问题,如果要取10篇文档,而第8,9,10,11,12篇文档的打分都相同,则抛弃那些呢?Lucene的策略是,在文档打分相同的情况下,文档号小的优先。
也即8,9,10被保留,11,12被抛弃。
由上面的叙述可知,创建collector的时候,根据文档是否将按照文档号从小到大的顺序返回而创建InOrderTopScoreDocCollector或者OutOfOrderTopScoreDocCollector。
对于InOrderTopScoreDocCollector,由于文档是按照顺序返回的,后来的文档号肯定大于前面的文档号,因而当score <= pqTop.score的时候,直接抛弃。
对于OutOfOrderTopScoreDocCollector,由于文档不是按顺序返回的,因而当score<pqTop.score,自然直接抛弃,当score==pqTop.score的时候,则要比较后来的文档和前面的文档的大小,如果大于,则抛弃,如果小于则入队列。


2.4.4.4、返回打分最高的N篇文档
IndexSearcher.search(Weight, Filter, int)中,在收集完文档后,调用collector.topDocs()返回打分最高的N篇文档:
public final TopDocs topDocs() {
  return topDocs(0, totalHits < pq.size() ? totalHits : pq.size());
}
public final TopDocs topDocs(int start, int howMany) {
  int size = totalHits < pq.size() ? totalHits : pq.size();
  howMany = Math.min(size - start, howMany);
  ScoreDoc[] results = new ScoreDoc[howMany];
  //由于pq是最小堆,因而要首先弹出最小的文档。比如qp中总共有50篇文档,想取第5到10篇文档,则应该先弹出打分最小的40篇文档。
  for (int i = pq.size() - start - howMany; i > 0; i--) { pq.pop(); }
  populateResults(results, howMany);
  return newTopDocs(results, start);
}
protected void populateResults(ScoreDoc[] results, int howMany) {
  //然后再从pq弹出第5到10篇文档,并按照打分从大到小的顺序放入results中。
  for (int i = howMany - 1; i >= 0; i--) {
    results[i] = pq.pop();
  }
}
protected TopDocs newTopDocs(ScoreDoc[] results, int start) {
  return results == null ? EMPTY_TOPDOCS : new TopDocs(totalHits, results);
}

2.4.5、Lucene如何在搜索阶段读取索引信息

以上叙述的是搜索过程中如何进行倒排表合并以及计算打分。然而索引信息是从索引文件中读出来的,下面分析如何读取这些信息。
其实读取的信息无非是两种信息,一个是词典信息,一个是倒排表信息。
词典信息的读取是在Scorer对象树生成的时候进行的,真正读取这些信息的是叶子节点TermScorer。
倒排表信息的读取时在合并倒排表的时候进行的,真正读取这些信息的也是叶子节点TermScorer.nextDoc()。
2.4.5.1、读取词典信息
此步是在TermWeight.scorer(IndexReader, boolean, boolean) 中进行的,其代码如下:
public Scorer scorer(IndexReader reader, boolean scoreDocsInOrder, boolean topScorer) {
  TermDocs termDocs = reader.termDocs(term);
  if (termDocs == null)
    return null;
  return new TermScorer(this, termDocs, similarity, reader.norms(term.field()));
}
ReadOnlySegmentReader.termDocs(Term)是找到Term并生成用来读倒排表的TermDocs对象:
public TermDocs termDocs(Term term) throws IOException {
  ensureOpen();
  TermDocs termDocs = termDocs();
  termDocs.seek(term);
  return termDocs;
}
termDocs()函数首先生成SegmentTermDocs对象,用于读取倒排表:
protected SegmentTermDocs(SegmentReader parent) {
  this.parent = parent;
  this.freqStream = (IndexInput) parent.core.freqStream.clone();//用于读取freq
  synchronized (parent) {
    this.deletedDocs = parent.deletedDocs;
  }
  this.skipInterval = parent.core.getTermsReader().getSkipInterval();
  this.maxSkipLevels = parent.core.getTermsReader().getMaxSkipLevels();
}
SegmentTermDocs.seek(Term)是读取词典中的Term,并将freqStream指向此Term对应的倒排表:
public void seek(Term term) throws IOException {
  TermInfo ti = parent.core.getTermsReader().get(term);
  seek(ti, term);
}
TermInfosReader.get(Term, boolean)主要是读取词典中的Term得到TermInfo,代码如下:
  private TermInfo get(Term term, boolean useCache) {
    if (size == 0) return null;
    ensureIndexIsRead();
    TermInfo ti;
    ThreadResources resources = getThreadResources();
    SegmentTermEnum enumerator = resources.termEnum;
    seekEnum(enumerator, getIndexOffset(term));
    enumerator.scanTo(term);
    if (enumerator.term() != null && term.compareTo(enumerator.term()) == 0) {
      ti = enumerator.termInfo();
    } else {
      ti = null;
    }
    return ti;
  }
在IndexReader打开一个索引文件夹的时候,会从tii文件中读出的Term index到indexPointers数组中,TermInfosReader.seekEnum(SegmentTermEnum enumerator, int indexOffset)负责在indexPointers数组中找Term对应的tis文件中所在的跳表区域的位置。
private final void seekEnum(SegmentTermEnum enumerator, int indexOffset) throws IOException {
  enumerator.seek(indexPointers[indexOffset],
                 (indexOffset * totalIndexInterval) - 1,
                 indexTerms[indexOffset], indexInfos[indexOffset]);
}
final void SegmentTermEnum.seek(long pointer, int p, Term t, TermInfo ti) {
  input.seek(pointer);
  position = p;
  termBuffer.set(t);
  prevBuffer.reset();
  termInfo.set(ti);
}
SegmentTermEnum.scanTo(Term)在跳表区域中,一个一个往下找,直到找到Term:
final int scanTo(Term term) throws IOException {
  scanBuffer.set(term);
  int count = 0;
  //不断取得下一个term到termBuffer中,目标term放入scanBuffer中,当两者相等的时候,目标Term找到。
  while (scanBuffer.compareTo(termBuffer) > 0 && next()) {
    count++;
  }
  return count;
}
public final boolean next() throws IOException {
  if (position++ >= size - 1) {
    prevBuffer.set(termBuffer);
    termBuffer.reset();
    return false;
  }
  prevBuffer.set(termBuffer);
  //读取Term的字符串
  termBuffer.read(input, fieldInfos);
  //读取docFreq,也即多少文档包含此Term
  termInfo.docFreq = input.readVInt();
  //读取偏移量
  termInfo.freqPointer += input.readVLong();
  termInfo.proxPointer += input.readVLong();
  if (termInfo.docFreq >= skipInterval)
      termInfo.skipOffset = input.readVInt();
  indexPointer += input.readVLong();
  return true;
}
TermBuffer.read(IndexInput, FieldInfos) 代码如下:
  public final void read(IndexInput input, FieldInfos fieldInfos) {
    this.term = null;
    int start = input.readVInt();
    int length = input.readVInt();
    int totalLength = start + length;
    text.setLength(totalLength);
    input.readChars(text.result, start, length);
    this.field = fieldInfos.fieldName(input.readVInt());
  }
SegmentTermDocs.seek(TermInfo ti, Term term)根据TermInfo,将freqStream指向此Term对应的倒排表位置:
void seek(TermInfo ti, Term term) {
  count = 0;
  FieldInfo fi = parent.core.fieldInfos.fieldInfo(term.field);
  df = ti.docFreq;
  doc = 0;
  freqBasePointer = ti.freqPointer;
  proxBasePointer = ti.proxPointer;
  skipPointer = freqBasePointer + ti.skipOffset;
  freqStream.seek(freqBasePointer);
  haveSkipped = false;
}
2.4.5.2、读取倒排表信息
当读出Term的信息得到TermInfo后,并且freqStream指向此Term的倒排表位置的时候,下面就是在TermScorer.nextDoc()函数中读取倒排表信息:
public int nextDoc() throws IOException {
  pointer++;
  if (pointer >= pointerMax) {
    pointerMax = termDocs.read(docs, freqs);   
    if (pointerMax != 0) {
      pointer = 0;
    } else {
      termDocs.close();
      return doc = NO_MORE_DOCS;
    }
  }
  doc = docs[pointer];
  return doc;
}
SegmentTermDocs.read(int[], int[]) 代码如下:

public int read(final int[] docs, final int[] freqs) {
  final int length = docs.length;
  int i = 0;
  while (i < length && count < df) {
    //读取docid
    final int docCode = freqStream.readVInt();
    doc += docCode >>> 1;
    if ((docCode & 1) != 0)      
      freq = 1;        
    else
      freq = freqStream.readVInt();     //读取freq
    count++;
    if (deletedDocs == null || !deletedDocs.get(doc)) {
      docs[i] = doc;
      freqs[i] = freq;
      ++i;
    }
    return i;
  }
}
Please read full article from Lucene学习总结之七:Lucene搜索过程解析(6)

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts