Lucene Highlighter HowTo



Lucene Highlighter HowTo
Last part of the indexing side is the TermVectors. From the Lucene Javadocs:
“A term vector is a list of the document’s terms and their number of occurrences in that document.”
For the Highlighter, TermVectors need to be available and you have a choice of either computing and storing them with the index at index time or computing them as you need them when the search is performed. Above, Field.TermVector.WITH_POSITIONS_OFFSETS indicates were are computing and storing them in the index at index time.

With the index ready for presenting contextual fragments, lets move on to generating them while processing a search request. Below is a typical “Hello World” type search block.
1
2
3
4
5
6
7
8
9
10
11
12
13
QueryParser qp = new QueryParser(“contents”, analyzer);
Query query = qp.parse(searchInput);
TopDocs hits = searcher.search(query, 10);
 
for (int i = 0; i < hits.scoreDocs.length; i++) {
    int docId = hits.scoreDocs[i].doc;
    Document doc = searcher.doc(docId);
    String filename = doc.get("filename");
    String contents =  doc.get(“contents”);
     
    String[] fragments = hlu.getFragmentsWithHighlightedTerms(analyzer,
                   query, “contents”, contents, 5, 100);
}
First we obtain the TokenStream. The call shown above assumes term vectors were not stored in the index at index time.
Next is the SpanScorer and SimpleSpanFragmenter. These work to break the contents into 100 character fragments and rank them by relevancy. You can use SpanScorer and SimpleSpanFragmenter or QueryScorer and SimpleFragmenter. The full details can be found in the Javadocs.
Note: when indexing large files, like the full contents of PDF manuals, you might need to tell the Highlighter object to look at the full text by calling the setMaxDocCharsToAnalyze method with Integer.MAX_VALUE or a more appropriate value. In my case, the default value specified by Lucene was too small, thus Highlighter did not look at the full text to generate the fragments. This was not good because the match I was seeking was near the end of the contents.
Finally, we tell the Highlighter to return the best 5 fragments.
The full code for this example can be downloaded from my Google Code project. The source file that makes use of the Highlighter is HighligherUtil.java
You can also find examples of using Highlighter in the Lucene SVN Repository, specificallyHighlighterTest.java
As you can see, returning search results with contextual fragments containing your search terms is very easy with the Lucene Highlighter contrib library once you know the steps to follow.
Please read full article from Lucene Highlighter HowTo

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts