Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene | My Blog by Philippe Adjiman
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other “disturbers”.
Note that the text “bi-gram” was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.
Read full article from Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene | My Blog by Philippe Adjiman
If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.
What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs.
public class NGramAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new StopFilter(new LowerCaseFilter(new ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')), StopAnalyzer.ENGLISH_STOP_WORDS); } }Shingle” is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other “disturbers”.
Note that the text “bi-gram” was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.
Read full article from Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene | My Blog by Philippe Adjiman
No comments:
Post a Comment