DigitalPebble's Blog: Using Payloads with DisMaxQParser in SOLR



DigitalPebble's Blog: Using Payloads with DisMaxQParser in SOLR
What I will describe here is how to use the payloads and have the functionalities of the DisMaxQParser in SOLR.
SOLR already has a field type for analysing payloads 
<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--
The DelimitedPayloadTokenFilter can put payloads on tokens... for example,
a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f
Attributes of the DelimitedPayloadTokenFilterFactory :
"delimiter" - a one character delimiter. Default is | (pipe)
"encoder" - how to encode the following value into a playload
float -> org.apache.lucene.analysis.payloads.FloatEncoder,
integer -> o.a.l.a.p.IntegerEncoder
identity -> o.a.l.a.p.IdentityEncoder
Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor.
-->
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldtype>
and we can also define a custom Similarity to use with the payloads
public class PayloadSimilarity extends DefaultSimilarity
{
@Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length)
{
if (length > 0) {
return PayloadHelper.decodeFloat(payload, offset);
}
return 1.0f;
}
}
then specify this in the SOLR schema
<!-- schema.xml -->
<similarity class="uk.org.company.solr.PayloadSimilarity" />
We now need a QueryParser plugin in order to use the payloads in the search and as mentioned above, I want to keep the functionalities of the DisMaxQueryParser.

The problem is that we need to specify PayloadTermQuery objects instead of TermQueries which is down deep in the object hierarchies and cannot AFAIK be modified simply from DismaxQueryParser.
I have implemented a modified version of DismaxQueryParser which rewrites the main part of the query (a.k.a userQuery in the implementation) and substitutes the TermQueries with PayloadTermQueries.
First we'll create a QParserPlugin
public class PLDisMaxQParser extends DisMaxQParser {
 
public static final String PAYLOAD_FIELDS_PARAM_NAME = "plf";
 
public PLDisMaxQParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
super(qstr, localParams, params, req);
}
 
protected HashSet<String> payloadFields = new HashSet<String>();
 
private static final PayloadFunction func = new MaxPayloadFunction();
 
float tiebreaker = 0f;
 
protected void addMainQuery(BooleanQuery query, SolrParams solrParams)
throws ParseException {
Map<String, Float> phraseFields = SolrPluginUtils
.parseFieldBoosts(solrParams.getParams(DisMaxParams.PF));
 
tiebreaker = solrParams.getFloat(DisMaxParams.TIE, 0.0f);
 
// get the comma separated list of fields used for payload
String[] plfarray = solrParams.get(PAYLOAD_FIELDS_PARAM_NAME, "")
.split(",");
for (String plf : plfarray)
payloadFields.add(plf.trim());
 
/*
* a parser for dealing with user input, which will convert things to
* DisjunctionMaxQueries
*/
SolrPluginUtils.DisjunctionMaxQueryParser up = getParser(queryFields,
DisMaxParams.QS, solrParams, tiebreaker);
 
/* for parsing sloppy phrases using DisjunctionMaxQueries */
SolrPluginUtils.DisjunctionMaxQueryParser pp = getParser(phraseFields,
DisMaxParams.PS, solrParams, tiebreaker);
 
/* * * Main User Query * * */
parsedUserQuery = null;
String userQuery = getString();
altUserQuery = null;
if (userQuery == null || userQuery.trim().length() < 1) {
// If no query is specified, we may have an alternate
altUserQuery = getAlternateUserQuery(solrParams);
query.add(altUserQuery, BooleanClause.Occur.MUST);
} else {
// There is a valid query string
userQuery = SolrPluginUtils.partialEscape(
SolrPluginUtils.stripUnbalancedQuotes(userQuery))
.toString();
userQuery = SolrPluginUtils.stripIllegalOperators(userQuery)
.toString();
 
parsedUserQuery = getUserQuery(userQuery, up, solrParams);
 
// recursively rewrite the elements of the query
Query payloadedUserQuery = rewriteQueriesAsPLQueries(parsedUserQuery);
query.add(payloadedUserQuery, BooleanClause.Occur.MUST);
 
Query phrase = getPhraseQuery(userQuery, pp);
if (null != phrase) {
query.add(phrase, BooleanClause.Occur.SHOULD);
}
}
}
 
/** Substitutes original query objects with payload ones **/
private Query rewriteQueriesAsPLQueries(Query input) {
Query output = input;
// rewrite TermQueries
if (input instanceof TermQuery) {
Term term = ((TermQuery) input).getTerm();
 
// check that this is done on a field that has payloads
if (payloadFields.contains(term.field()) == false)
return input;
 
output = new PayloadTermQuery(term, func);
}
// rewrite PhraseQueries
else if (input instanceof PhraseQuery) {
PhraseQuery pin = (PhraseQuery) input;
Term[] terms = pin.getTerms();
int slop = pin.getSlop();
boolean inorder = false;
 
// check that this is done on a field that has payloads
if (terms.length > 0
&& payloadFields.contains(terms[0].field()) == false)
return input;
 
SpanQuery[] clauses = new SpanQuery[terms.length];
// phrase queries : keep the default function i.e. average
for (int i = 0; i < terms.length; i++)
clauses[i] = new PayloadTermQuery(terms[i], func);
 
output = new PayloadNearQuery(clauses, slop, inorder);
}
// recursively rewrite DJMQs
else if (input instanceof DisjunctionMaxQuery) {
DisjunctionMaxQuery s = ((DisjunctionMaxQuery) input);
DisjunctionMaxQuery t = new DisjunctionMaxQuery(tiebreaker);
Iterator<Query> disjunctsiterator = s.iterator();
while (disjunctsiterator.hasNext()) {
Query rewrittenQuery = rewriteQueriesAsPLQueries(disjunctsiterator
.next());
t.add(rewrittenQuery);
}
output = t;
}
// recursively rewrite BooleanQueries
else if (input instanceof BooleanQuery) {
for (BooleanClause clause : (List<BooleanClause>) ((BooleanQuery) input)
.clauses()) {
Query rewrittenQuery = rewriteQueriesAsPLQueries(clause
.getQuery());
clause.setQuery(rewrittenQuery);
}
}
 
output.setBoost(input.getBoost());
return output;
}
public void addDebugInfo(NamedList<Object> debugInfo) {
super.addDebugInfo(debugInfo);
if (this.payloadFields.size() > 0) {
Iterator<String> iter = this.payloadFields.iterator();
while (iter.hasNext())
debugInfo.add("payloadField", iter.next());
}
}
 
}
<queryParser name="payload" class="com.digitalpebble.solr.PLDisMaxQParserPlugin" />
then specify for the requestHandler : 
<str name="defType">payload</str>
 
<!-- plf : comma separated list of field names --> 
 <str name="plf">
  payloads
 </str>
 
The fields listed in the parameter plf will be queried with Payload query objects.  Remember that you can use &debugQuery=true to get the details of the scores and check that the payloads are being used.
Read full article from DigitalPebble's Blog: Using Payloads with DisMaxQParser in SOLR

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts