[Lucene] 使用Lucene创建自定义的词干分析器
- public class PorterStemStopWordAnalyzer extends Analyzer {
- // 自定义停用词
- private static final String[] stopWords = {"and", "of", "the", "to", "is", "their", "can", "all"};
- public PorterStemStopWordAnalyzer() {
- }
- @Override
- protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
- // 创建一个分词器
- Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_46, reader);
- // 创建一系列的分词过滤器
- TokenFilter lowerCaseFilter = new LowerCaseFilter(Version.LUCENE_46, tokenizer);
- TokenFilter synonymFilter = new SynonymFilter(lowerCaseFilter, getSynonymMap(), true);
- TokenFilter stopFilter = new StopFilter(Version.LUCENE_46, synonymFilter, buildCharArraySetFromArry(stopWords));
- TokenFilter stemFilter = new PorterStemFilter(stopFilter);
- // TokenStream的包装类 在2.2之中 是TokenStream
- return new TokenStreamComponents(tokenizer, stemFilter);
- }
- // 将数组转成lucene可识别的CharArraySet对象 CharArraySet类似java.util.set
- private CharArraySet buildCharArraySetFromArry(String[] array) {
- CharArraySet set = new CharArraySet(Version.LUCENE_46, array.length, true);
- for(String value : array) {
- set.add(value);
- }
- return set;
- }
- // 创建一个同义词表
- private SynonymMap getSynonymMap() {
- String base1 = "fast";
- String syn1 = "rapid";
- String base2 = "slow";
- String syn2 = "sluggish";
- SynonymMap.Builder sb = new SynonymMap.Builder(true);
- sb.add(new CharsRef(base1), new CharsRef(syn1), true);
- sb.add(new CharsRef(base2), new CharsRef(syn2), true);
- SynonymMap smap = null;
- try {
- smap = sb.build();
- } catch (IOException e) {
- e.printStackTrace();
- }
- return smap;
- }
- // 测试方法
- public static void testPorterStemmingAnalyzer() throws IOException {
- Analyzer analyzer = new PorterStemStopWordAnalyzer();
- String text = "Collective intelligence and Web2.0, fast and rapid";
- Reader reader = new StringReader(text);
- TokenStream ts = null;
- try {
- ts = analyzer.tokenStream(null, reader);
- ts.reset();
- while(ts.incrementToken()) {
- CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class);
- System.out.println(ta.toString());
- }
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- public static void main(String[] args) throws IOException {
- testPorterStemmingAnalyzer();
- }
(2) 将TokenStream 转成Token 常用的一个方法就是使用CharTermAttribute
除了CharTermAttribute 还有其他的Attribute: 比如FlagsAttribute ...
(3) 使用到的类库可以参考上一篇文章:http://rangerwolf.iteye.com/admin/blogs/2011535
(4) 在createComponents方法之中使用了一个同义词过滤器,在构造这个过滤器的时候是通过getSynonymMap方法进行的
Another Custom Analyzer http://blog.csdn.net/wildcatlele/article/details/7526586
[Lucene] 使用Lucene创建自定义的词干分析器Another Custom Analyzer http://blog.csdn.net/wildcatlele/article/details/7526586
No comments:
Post a Comment