Boosting Documents in Solr by Recency, Popularity, and User Preferences
Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
FunctionQuery: Computes a value for each document
Ranking
Sorting
Use the recip function with the ms function:
q={!boost b=$recency v=$qq}&
recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&
qq=wine
Use edismax vs. dismax if possible:
q=wine&
boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)
Recip is a highly tunable function
recip(x,m,a,b) implementing a / (m*x + b)
m = 3.16E-11 a= 0.08 b=0.05 x = Document Age
Boost should be a multiplier on the relevancy score
{!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit
q={!boost b=$recency v=$qq}&spellcheck.q=wine
Bottom out the old age penalty using min:
min(recip(…), 0.20)
Not a one-size fits all solution – academic research focused on when to apply it
Score based on number of unique views
Not known at indexing time
View count should be broken into time slots
fieldType name="externalPopularityScore"
keyField="id"
defVal="1"
stored="false" indexed="false"
class=”solr.ExternalFileField"
valType="pfloat"/>
<field name="popularity"
type="externalPopularityScore" />
For big, high traffic sites, use log analysis
Perfect problem for MapReduce
Take a look at Hive for analyzing large volumes of log data
Minimum popularity score is 1 (not zero) … up to 2 or more
1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …)
Watch out for spell checker “buildOnCommit”
Filtering By User Preferences
Easy approach is to build basic preference fields in to the index:
Content types of interest – content_type
High-level categories of interest - category
Source of interest – source
We had too many categories and sources that a user could enable / disable to use basic filtering
Custom SearchComponent with a connection to a JDBC DataSource
Connects to a database
Caches DocIdSet in a Solr FastLRUCache
Cached values marked as dirty using a simple timestamp passed in the request
Declared in solrconfig.xml:
<searchComponent
class=“demo.solr.PreferencesComponent"
name=”pref">
<str name="jdbcJndi">jdbc/solr</str>
</searchComponent>
Parameters passed in the query string:
pref.id = primary key in db
pref.mod = preferences modified on timestamp
So the Solr side knows the database has been updated
Use simple SQL queries to compute a list of disabled categories, feeds, and types
Lucene FieldCaches for category, source, type
Custom SearchComponent included in the list of components for edismax search handler
<arr name="last-components">
<str>pref</str>
</arr>
Use recip & ms functions to boost recent documents
Use ExternalFileField to load popularity scores calculated outside the index
Use a custom SearchComponent with a Solr FastLRUCache to filter documents using complex user preferences
Please read full article from Boosting Documents in Solr by Recency, Popularity, and User Preferences
Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
FunctionQuery: Computes a value for each document
Ranking
Sorting
Use the recip function with the ms function:
q={!boost b=$recency v=$qq}&
recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&
qq=wine
Use edismax vs. dismax if possible:
q=wine&
boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)
Recip is a highly tunable function
recip(x,m,a,b) implementing a / (m*x + b)
m = 3.16E-11 a= 0.08 b=0.05 x = Document Age
Boost should be a multiplier on the relevancy score
{!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit
q={!boost b=$recency v=$qq}&spellcheck.q=wine
Bottom out the old age penalty using min:
min(recip(…), 0.20)
Not a one-size fits all solution – academic research focused on when to apply it
Score based on number of unique views
Not known at indexing time
View count should be broken into time slots
fieldType name="externalPopularityScore"
keyField="id"
defVal="1"
stored="false" indexed="false"
class=”solr.ExternalFileField"
valType="pfloat"/>
<field name="popularity"
type="externalPopularityScore" />
For big, high traffic sites, use log analysis
Perfect problem for MapReduce
Take a look at Hive for analyzing large volumes of log data
Minimum popularity score is 1 (not zero) … up to 2 or more
1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …)
Watch out for spell checker “buildOnCommit”
Filtering By User Preferences
Easy approach is to build basic preference fields in to the index:
Content types of interest – content_type
High-level categories of interest - category
Source of interest – source
We had too many categories and sources that a user could enable / disable to use basic filtering
Custom SearchComponent with a connection to a JDBC DataSource
Connects to a database
Caches DocIdSet in a Solr FastLRUCache
Cached values marked as dirty using a simple timestamp passed in the request
Declared in solrconfig.xml:
<searchComponent
class=“demo.solr.PreferencesComponent"
name=”pref">
<str name="jdbcJndi">jdbc/solr</str>
</searchComponent>
Parameters passed in the query string:
pref.id = primary key in db
pref.mod = preferences modified on timestamp
So the Solr side knows the database has been updated
Use simple SQL queries to compute a list of disabled categories, feeds, and types
Lucene FieldCaches for category, source, type
Custom SearchComponent included in the list of components for edismax search handler
<arr name="last-components">
<str>pref</str>
</arr>
Use recip & ms functions to boost recent documents
Use ExternalFileField to load popularity scores calculated outside the index
Use a custom SearchComponent with a Solr FastLRUCache to filter documents using complex user preferences
No comments:
Post a Comment