Getting Started with Payloads – Lucidworks
The DPTF allows you to add payloads to tokens simply by marking up the tokens with a special character followed by the payload value.
Characters before the delimiter are the "token", those after are the payload.
For example, if the delimiter is '|', then for the string "foo|bar", foo is the token and "bar" is a payload.
The DPTF will then use this to encode the payloads using the PayloadEncoder. A PayloadEncoder is an interface that tells the DPTF how to convert the payload to a byte array. Also note that Lucene’s contrib/analysis package contains several other TokenFilters for adding payloads to a Token and, of course, you can write your own as well. Furthermore, the PayloadHelper class can help encode/decode payloads for common types.
Overriding the Similarity Class
The next step, which should happen before indexing, is to override the Similarity class to handle payloads. While it is isn’t strictly required that this happens before indexing in THIS case, it is a good habit to do in case you have made other changes to the Similarity class that are required during indexing (such as overriding how norms are encoded.)
There are three parts to taking advantage of payloads in Lucene. Solr requires a fourth step, which I will explain in a moment.
- Add a Payload to one or more Tokens during indexing.
- Override the Similarity class to handle scoring payloads
- Use a Payload aware Query during your search
For Solr, step 3 requires you to have your own Query Parser, as none of the existing Solr Query Parsers support the BoostingTermQuery. Thus, the third step for Solr is add a Query Parser that supports payloads (and Spans would be nice, too! Please donate if you do this!)
Adding Payloads during indexing
class PayloadAnalyzer extends Analyzer {
private PayloadEncoder encoder;
PayloadAnalyzer(PayloadEncoder encoder) {
this.encoder = encoder;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
result = new DelimitedPayloadTokenFilter(result, '|', encoder);
return result;
}
}
Characters before the delimiter are the "token", those after are the payload.
For example, if the delimiter is '|', then for the string "foo|bar", foo is the token and "bar" is a payload.
The DPTF will then use this to encode the payloads using the PayloadEncoder. A PayloadEncoder is an interface that tells the DPTF how to convert the payload to a byte array. Also note that Lucene’s contrib/analysis package contains several other TokenFilters for adding payloads to a Token and, of course, you can write your own as well. Furthermore, the PayloadHelper class can help encode/decode payloads for common types.
Overriding the Similarity Class
The next step, which should happen before indexing, is to override the Similarity class to handle payloads. While it is isn’t strictly required that this happens before indexing in THIS case, it is a good habit to do in case you have made other changes to the Similarity class that are required during indexing (such as overriding how norms are encoded.)
class PayloadSimilarity extends DefaultSimilarity {
@Override
public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
}
}
Executing the Query
Currently, Lucene has one payload aware Query called the BoostingTermQuery (BTQ for short, see [2] for another Payload aware query that may be in Lucene 2.9), which can be used just like any other query. For instance:
IndexSearcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(payloadSimilarity);
BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
TopDocs topDocs = searcher.search(btq, 10);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
ScoreDoc doc = topDocs.scoreDocs[i];
System.out.println("Doc: " + doc.toString());
System.out.println("Explain: " + searcher.explain(btq, doc.doc));
}
Next Steps
As you can see from above, getting started with Payloads is pretty easy. In reality, the only hard part is determining what exactly to put in your payload and then how it should factor into your score. Lucene takes care of the rest. Tools like UIMA, OpenNLP and other proprietary vendors can often be used to provide higher level lexical, syntactical and semantic information about tokens, thus giving you the power to create very expressive payloads and richer search applications.
Read full article from Getting Started with Payloads – Lucidworks
No comments:
Post a Comment