Tokenizing with Lucene | LucidBox



Tokenizing with Lucene | LucidBox
A really simple way of tokenizing without using Lucene would be to just split the String using a RegEx. Here is a naive one-liner:

String tokens[] = testString.split("\\W");
The regex here is “\W” which is the negated word character class. This might work well enough for what you need, but you lose positional information and might not yield expected results in other languages. 

A really simple way of tokenizing without using Lucene would be to just split the String using a RegEx. Here is a naive one-liner:String tokens[] = testString.split("\\W");


The regex here is “\W” which is the negated word character class. This might work well enough for what you need, but you lose positional information and might not yield expected results in other languages.

Lucene’s Filters and Tokenizers (which extend TokenStream) store attributes for each token depending on their functionality. From the StandardTokenizer we get attributes which contain the token itself, the token’s type and positional information. So we need to call getAttribute with the class of the Attribute we’re interested in, CharTermAttribute is the one StandardTokenizer uses to contain the actual token text, etc…See the javadoc ongetAttribute for additional usage details.
The calls to tokenizer’s reset(), end() and close() are part of the workflow prescribed by TokenStream’s javadoc, which contains additional helpful documentation.
CharTermAttribute charTermAttrib = tokenizer.getAttribute(CharTermAttribute.class);
TypeAttribute typeAtt = tokenizer.getAttribute(TypeAttribute.class);
OffsetAttribute offset = tokenizer.getAttribute(OffsetAttribute.class);

List<String> tokens = new ArrayList<String>();
tokenizer.reset();
while (tokenizer.incrementToken()) {
  tokens.add(charTermAttrib.toString());
  System.out.println(typeAtt.toString() + " " + offset.toString() +      ": " + charTermAttrib.toString());
}
tokenizer.end();
tokenizer.close();
In the above code, we get the attributes once and then loop through tokenizer’s tokens printing out the attributes for each token. Since these attributes came from the StandardTokenizer, which again is a TokenStream, you should consider that these attributes are actually attributes of the stream of tokens and hence is tied to the state of tokenizer. So when we change the state of the tokenizer by calling incrementToken() the values of the attributes change as well without requiring us to manually call getAttribute again.
We can note that the sentence delimiters are gone, the offsets properly skip over them, and the casing of the terms is preserved.
Suppose you really want those sentence delimiters to appear as tokens? Well, there are some other tokenizers we can try. Change tokenizer to a WhitespaceTokenizer to give it a shot.
Tokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_36, reader);
If you run the code now, you’ll get an IllegalArgumentException. Oops! This happens when we try to get the TypeAttribute because WhitespaceTokenizer doesn’t supply any types in its TokenStream. If we change our getAttribute calls to addAttribute it will add them to the TokenStream if they don’t exist and return them just like getAttribute if they do.
CharTermAttribute charTermAttrib = tokenizer.addAttribute(CharTermAttribute.class);
 TypeAttribute typeAtt = tokenizer.getAttribute(TypeAttribute.class);
 OffsetAttribute offset = tokenizer.addAttribute(OffsetAttribute.class);
Of course the values of the attributes that aren’t actually set by WhitespaceTokenizer will stay at their default. You may wish to use tokenizer.hasAttribute(TypeAttribute.class) to check if that Attribute exists instead of just adding it if your code later on expects it to return actual values.
You may be able to ignore the edge cases for your application, but if you need more rigorous NLP, you can look at OpenNLP which does Sentence Detection as well as Tokenizing, Entity Extraction, etc… based on trained Maximum Entropy models (some of the provided models are even available in other languages).
Read full article from Tokenizing with Lucene | LucidBox

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts