All About Programming: Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene

Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene | My Blog by Philippe Adjiman

If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.

What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs.

public class NGramAnalyzer extends Analyzer {
 @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
       return new StopFilter(new LowerCaseFilter(new ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
           StopAnalyzer.ENGLISH_STOP_WORDS);
     }
}

Shingle” is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other “disturbers”.

Note that the text “bi-gram” was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.
Read full article from Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene | My Blog by Philippe Adjiman

Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene | My Blog by Philippe Adjiman

No comments:

Post a Comment

Labels

Popular Posts