Boosting Documents in Solr by Recency, Popularity, and User Preferences



Boosting Documents in Solr by Recency, Popularity, and User Preferences

Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);


FunctionQuery: Computes a value for each document
Ranking
Sorting

Use the recip function with the ms function:
q={!boost b=$recency v=$qq}&
 recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&
 qq=wine

Use edismax vs. dismax if possible:
 q=wine&
 boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)

Recip is a highly tunable function
recip(x,m,a,b) implementing a / (m*x + b)
m = 3.16E-11 a= 0.08 b=0.05 x = Document Age

Boost should be a multiplier on the relevancy score 

{!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit
q={!boost b=$recency v=$qq}&spellcheck.q=wine 

Bottom out the old age penalty using min:
min(recip(…), 0.20)

Not a one-size fits all solution – academic research focused on when to apply it 
Score based on number of unique views
Not known at indexing time
View count should be broken into time slots

fieldType name="externalPopularityScore"  
           keyField="id" 
           defVal="1" 
           stored="false" indexed="false" 
           class=”solr.ExternalFileField" 
           valType="pfloat"/>

<field name="popularity" 
       type="externalPopularityScore" />

For big, high traffic sites, use log analysis
Perfect problem for MapReduce
Take a look at Hive for analyzing large volumes of log data

Minimum popularity score is 1 (not zero) … up to 2 or more
1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …)

Watch out for spell checker “buildOnCommit”

Filtering By User Preferences
Easy approach is to build basic preference fields in to the index:
Content types of interest – content_type
High-level categories of interest - category
Source of interest – source

We had too many categories and sources that a user could enable / disable to use basic filtering
Custom SearchComponent with a connection to a JDBC DataSource

Connects to a database
Caches DocIdSet in a Solr FastLRUCache
Cached values marked as dirty using a simple timestamp passed in the request

Declared in solrconfig.xml:
  <searchComponent   
      class=“demo.solr.PreferencesComponent" 
      name=”pref">
    <str name="jdbcJndi">jdbc/solr</str>  
  </searchComponent>

Parameters passed in the query string:
pref.id = primary key in db
pref.mod = preferences modified on timestamp
So the Solr side knows the database has been updated
Use simple SQL queries to compute a list of disabled categories, feeds, and types
Lucene FieldCaches for category, source, type
Custom SearchComponent included in the list of components for edismax search handler
 <arr name="last-components">
      <str>pref</str>   
    </arr>
Use recip & ms functions to boost recent documents

Use ExternalFileField to load popularity scores calculated outside the index


Use a custom SearchComponent with a Solr FastLRUCache to filter documents using complex user preferences
Please read full article from Boosting Documents in Solr by Recency, Popularity, and User Preferences

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts