Lucene’s docFreq Got You Down? Replace It With a Custom Collector
A Counting Collector

When you do searches in Lucene you can give the searcher a Collector which feels a bit like the visitor pattern as the search calls your collector once for each document that matches your query.
public class CounterCollector : Collector
{
    public int Count { getprivate set; }
    public void Reset()
    {
        Count = 0;
    }
    public override void Collect(int docID)
    {
        Count = Count + 1;
    }
    public override void SetScorer(Scorer scorer) { }
    public override void SetNextReader(IndexReader reader, int docBase) { }
    public override bool AcceptsDocsOutOfOrder()
    {
        return true;
    }
}

How Do I Use One Of Those?

public int GetNumberOfDocumentsForTerm(Term term)
{
    return searchIndex(searcher =>
    {
        //replacing this
        //return searcher.DocFreq(term);
        
        //with this
        var counterCollector = new CounterCollector();
        searcher.Search(new TermQuery(term), counterCollector);
        return counterCollector.Count;
    });
}
Take Away
Doing searches with custom collectors is quite easy. You just give an instance of one to the search method and interrogate it afterwards for the information you require.

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts