hasCode.com » Blog Archive » Lucene by Example: Specifying Analyzers on a per-field-basis and writing a custom Analyzer/Tokenizer



hasCode.com » Blog Archive » Lucene by Example: Specifying Analyzers on a per-field-basis and writing a custom Analyzer/Tokenizer
Lucene Dependencies
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queries</artifactId>
<version>${lucene.version}</version>
</dependency>
Writing a custom Analyzer and Tokenizer
As final result we want to be able to create multiple tokens from an input string by splitting it by the character “e” and case-insensitive and in addition the character “e” should not be part of the tokens created.

Two simple examples:
123e456e789 -> the tokens “123“, “456” and “789” should be extracted
123Eabcexyz -> the tokens “123“, “abc” and “xyz“  should be extracted
Character based Tokenizer

To create a tokenizer to fit the scenario above is easy for us as there already exists the CharTokenizer that our custom tokenizer class may inherit.

We just need to implement one method that gets the codepoint value of the parsed character as parameter and returns whether it matches the character “e” .

Older Lucene Versions: Since Lucene 3.1 the CharTokenizer API has changed, in older versions we’re using isTokenChar(char c) instead.
public class ECharacterTokenizer extends CharTokenizer {
 public ECharacterTokenizer(final Version matchVersion, final Reader input) {
  super(matchVersion, input);
 }
 protected boolean isTokenChar(final int character) {
  return 'e' != character;
 }

}
Analyzer using the custom Tokenizer
Now that we’ve got a simple tokenizer we’d like to add an analyzer using our tokenizer and making our analysis case-insensitive.

This is really easy as there already exists a LowerCaseFilter and we may assemble our solution with the following few lines of code:
public class ECharacterAnalyser extends Analyzer {
 private final Version version;
 public ECharacterAnalyser(final Version version) {
  this.version = version;
 }
 // just for luke ;)
 public ECharacterAnalyser() {
  version = Version.LUCENE_48;
 }
 protected TokenStreamComponents createComponents(final String field,
   final Reader reader) {
  Tokenizer tokenizer = new ECharacterTokenizer(version, reader);
  TokenStream filter = new LowerCaseFilter(version, tokenizer);
  return new TokenStreamComponents(tokenizer, filter);
 }
}
Specifying Analyzers for each Document Field
An analyzer is used when input is stored in the index and when input is processed in a search query.
Lucene’s PerFieldAnalyzerWrapper allows us to specify an analyzer for each field name and a default analyzer as a fallback.
In the following example, we’re assigning two analyzers to the fields named “somefield” and “someotherfield” and the StandardAnalyzer is used as a default for every other field not specified in the mapping.
Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>();
analyzerPerField.put("email", new KeywordAnalyzer());
analyzerPerField.put("specials", new ECharacterAnalyser(version));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(
new StandardAnalyzer(version), analyzerPerField);
IndexWriterConfig config = new IndexWriterConfig(version, analyzer)
.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(index, config);

Appendix A: Installing/Running Luke – The Lucene Index Toolbox
there is this project maintained by Dmitry Kan on GitHub with the following releases ready for download.
Appendix C: Running Luke with custom Analyzers
Therefore we need to add our analyzer to the class-path when running Luke – as the command line option -jar makes Java ignore class-paths set with -cp , we need to skip this option an specify the main-class to run like in the following example:

java -cp "luke-with-deps-4.8.0.jar:/path/to/lucene-per-field-analyzer-tutorial/target/lucene-perfield-analyzer-tutorial-1.0.0.jar" org.getopt.luke.Luke

This allows us to enter the full qualified name of our analyzer class in the Luke analyzer tool and run an analysis.
Read full article from hasCode.com » Blog Archive » Lucene by Example: Specifying Analyzers on a per-field-basis and writing a custom Analyzer/Tokenizer

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts