Code Reduction: Boosting Documents in Lucene



Code Reduction: Boosting Documents in Lucene
In Information Retrieval, a document's relevance to a search is measured by how similar it is to the query. There are several similarity models implemented in Lucene, and you can implement your own by extending the Similarity class and using the index statistics that Lucene saves. Documents can also be assigned astatic score to denote their importance in the overall corpus, irrespective of the query that is being executed, e.g their popularity, ratings or PageRank

Prior to Lucene 4.0, you could assign a static score to a document by calling document.setBoost. Internally, the boost was applied to every field of the document, by multiplying the field's boost factor with the document's. However, as explained here, this has never worked correctly and depending on the type of query executed, might not affect the document's rank at all. 

With the addition of DocValues to Lucene, boosting documents is as easy as adding aNumericDocValuesField and use it in a CustomScoreQuery, which multiplies the computed score by the value of the 'boost' field.

doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 2L));
writer.addDocument(doc);

// search for 'test' while boosting by field 'boost'
Query baseQuery = new TermQuery(new Term("f", "test"));
Query boostQuery = new FunctionQuery(new LongFieldSource("boost"));
Query q = new CustomScoreQuery(baseQuery, boostQuery);
searcher.search(q, 10);

The new Expressions module can also be used for boosting documents by writing a simple formula, as depicted below. While it's more verbose than using CustomScoreQuery, it makes boosting by computing more complex formulas trivial, e.g. sqrt(_score) + ln(boost).
Expression expr = JavascriptCompiler.compile("_score * boost");
SimpleBindings bindings = new SimpleBindings();    
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("boost", SortField.Type.LONG));
Sort sort = new Sort(expr.getSortField(bindings, true));
searcher.search(baseQuery, null, 10, sort);

Now that Lucene allows updating NumericDocValuesFields without re-indexing the documents, you can incorporate frequently changing fields (popularity, ratings, price, last-modified time...) in the boosting factor without re-indexing the document every time any one of them changes.
Read full article from Code Reduction: Boosting Documents in Lucene

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts