Need to set outputUnigrams = false with something like:
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_43, reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);
tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);
TokenFilter sf = new ShingleFilter(tokenStream, 3,3);
((ShingleFilter)sf).setOutputUnigrams(false);
sf = new StopFilter(Version.LUCENE_43,sf,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new Analyzer.TokenStreamComponents(source, sf);
Read full article from Lucene - Java Users - ShingleFilter
No comments:
Post a Comment