[Lucene] 使用Lucene创建自定义的词干分析器



[Lucene] 使用Lucene创建自定义的词干分析器
  1. public class PorterStemStopWordAnalyzer extends Analyzer {  
  2.   
  3.     // 自定义停用词  
  4.     private static final String[] stopWords = {"and""of""the""to""is""their""can""all"};  
  5.     public PorterStemStopWordAnalyzer() {  
  6.     }  
  7.   
  8.     @Override  
  9.     protected TokenStreamComponents createComponents(String fieldName, Reader reader) {  
  10.         // 创建一个分词器  
  11.         Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_46, reader);  
  12.           
  13.         // 创建一系列的分词过滤器  
  14.         TokenFilter lowerCaseFilter = new LowerCaseFilter(Version.LUCENE_46, tokenizer);  
  15.         TokenFilter synonymFilter = new SynonymFilter(lowerCaseFilter, getSynonymMap(), true);  
  16.         TokenFilter stopFilter = new StopFilter(Version.LUCENE_46, synonymFilter, buildCharArraySetFromArry(stopWords));  
  17.         TokenFilter stemFilter = new PorterStemFilter(stopFilter);  
  18.           
  19.         // TokenStream的包装类 在2.2之中 是TokenStream  
  20.         return new TokenStreamComponents(tokenizer, stemFilter);  
  21.     }  
  22.       
  23.     // 将数组转成lucene可识别的CharArraySet对象 CharArraySet类似java.util.set  
  24.     private CharArraySet buildCharArraySetFromArry(String[] array) {  
  25.         CharArraySet set = new CharArraySet(Version.LUCENE_46, array.length, true);  
  26.         for(String value : array) {  
  27.             set.add(value);  
  28.         }  
  29.         return set;  
  30.     }  
  31.       
  32.     // 创建一个同义词表  
  33.     private SynonymMap getSynonymMap() {  
  34.         String base1 = "fast";  
  35.         String syn1 = "rapid";  
  36.           
  37.         String base2 = "slow";  
  38.         String syn2 = "sluggish";  
  39.           
  40.         SynonymMap.Builder sb = new SynonymMap.Builder(true);  
  41.         sb.add(new CharsRef(base1), new CharsRef(syn1), true);  
  42.         sb.add(new CharsRef(base2), new CharsRef(syn2), true);  
  43.         SynonymMap smap = null;  
  44.         try {  
  45.             smap = sb.build();  
  46.         } catch (IOException e) {  
  47.             e.printStackTrace();  
  48.         }  
  49.         return smap;  
  50.     }  
  51.       
  52.     // 测试方法  
  53.     public static void testPorterStemmingAnalyzer() throws IOException {  
  54.         Analyzer analyzer = new PorterStemStopWordAnalyzer();  
  55.         String text = "Collective intelligence and Web2.0, fast and rapid";  
  56.         Reader reader = new StringReader(text);  
  57.         TokenStream ts = null;  
  58.         try {  
  59.             ts = analyzer.tokenStream(null, reader);  
  60.             ts.reset();  
  61.             while(ts.incrementToken()) {  
  62.                 CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class);    
  63.                 System.out.println(ta.toString());  
  64.             }  
  65.         } catch (IOException e) {  
  66.             e.printStackTrace();  
  67.         }   
  68.           
  69.     }  
  70.       
  71.     public static void main(String[] args) throws IOException {  
  72.         testPorterStemmingAnalyzer();  
  73.     }  
  74.   
(1) TokenStream在初始化之后需要reset一次,不然会抛出异常
(2) 将TokenStream 转成Token 常用的一个方法就是使用CharTermAttribute
除了CharTermAttribute 还有其他的Attribute: 比如FlagsAttribute ...
(3) 使用到的类库可以参考上一篇文章:http://rangerwolf.iteye.com/admin/blogs/2011535
(4) 在createComponents方法之中使用了一个同义词过滤器,在构造这个过滤器的时候是通过getSynonymMap方法进行的

Another Custom Analyzer http://blog.csdn.net/wildcatlele/article/details/7526586
[Lucene] 使用Lucene创建自定义的词干分析器

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts