Getting Started with Payloads - Lucidworks



Getting Started with Payloads – Lucidworks
There are three parts to taking advantage of payloads in Lucene.  Solr requires a fourth step, which I will explain in a moment.
  1. Add a Payload to one or more Tokens during indexing.
  2. Override the Similarity class to handle scoring payloads
  3. Use a Payload aware Query during your search
For Solr, step 3 requires you to have your own Query Parser, as none of the existing Solr Query Parsers support the BoostingTermQuery.  Thus, the third step for Solr is add a Query Parser that supports payloads (and Spans would be nice, too!  Please donate if you do this!)

Adding Payloads during indexing

class PayloadAnalyzer extends Analyzer {
    private PayloadEncoder encoder;

    PayloadAnalyzer(PayloadEncoder encoder) {
      this.encoder = encoder;
    }

    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new WhitespaceTokenizer(reader);
      result = new LowerCaseFilter(result);
      result = new DelimitedPayloadTokenFilter(result, '|', encoder);
      return result;
    }
  }
The DPTF allows you to add payloads to tokens simply by marking up the tokens with a special character followed by the payload value. 
Characters before the delimiter are the "token", those after are the payload.
For example, if the delimiter is '|', then for the string "foo|bar", foo is the token and "bar" is a payload.

The DPTF will then use this to encode the payloads using the PayloadEncoder. A PayloadEncoder is an interface that tells the DPTF how to convert the payload to a byte array. Also note that Lucene’s contrib/analysis package contains several other TokenFilters for adding payloads to a Token and, of course, you can write your own as well.  Furthermore, the PayloadHelper class can help encode/decode payloads for common types.
Overriding the Similarity Class
The next step, which should happen before indexing, is to override the Similarity class to handle payloads.  While it is isn’t strictly required that this happens before indexing in THIS case, it is a good habit to do in case you have made other changes to the Similarity class that are required during indexing (such as overriding how norms are encoded.)
class PayloadSimilarity extends DefaultSimilarity {
    @Override
    public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
      return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
    }
}

Executing the Query

Currently, Lucene has one payload aware Query called the BoostingTermQuery (BTQ for short,  see [2] for another Payload aware query that may be in Lucene 2.9), which can be used just like any other query.  For instance:
IndexSearcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(payloadSimilarity);
BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
TopDocs topDocs = searcher.search(btq, 10);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
   ScoreDoc doc = topDocs.scoreDocs[i];
   System.out.println("Doc: " + doc.toString());
   System.out.println("Explain: " + searcher.explain(btq, doc.doc));
}

Next Steps

As you can see from above, getting started with Payloads is pretty easy.  In reality, the only hard part is determining what exactly to put in your payload and then how it should factor into your score.  Lucene takes care of the rest.  Tools like UIMA, OpenNLP and other proprietary vendors can often be used to provide higher level lexical, syntactical and semantic information about tokens, thus giving you the power to create very expressive payloads and richer search applications.
Read full article from Getting Started with Payloads – Lucidworks

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts