Custom Lucene Scoring | Architexa - Working with Large Codebases >> Blog Archive



Scoring Variables

Lucene’s default scoring system works very well for most cases. It uses seven different variables to determine the final ranking of each document. They are: (from lucenetutorial.com)

  • tf = term frequency in document = measure of how often a term appears in the document
  • idf = inverse document frequency = measure of how often the term appears across the index
  • coord = number of terms in the query that were found in the document
  • lengthNorm = measure of the importance of a term according to the total number of terms in the field
  • queryNorm = normalization factor so that queries can be compared
  • boost (index) = boost of the field at index-time
  • boost (query) = boost of the field at query-time

These factors are fed into the Similarity algorithm, details of which can be found in Lucene’s java-doc and tutorial pages. For the moment I will focus on the simplest method for adjusting scoring: “Boost”.


Read full article from Custom Lucene Scoring | Architexa – Working with Large Codebases » Blog Archive


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts