Lucene (v4.5) 深入学习 (九) —— 分析器



Lucene (v4.5) 深入学习 (九) —— 分析器

考虑到很可能会扩展自己的分词器,再加上这东西是索引过程的一部分,所以就在这里简单分析一下Lucene的分析器。

1 TokenStream

分析器处理的对象是TokenStream。获得TokanStream的过程如下图:
20140529 获得tokenStream
public final TokenStream tokenStream(final String fieldName, final String text) throws IOException {
    TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
    @SuppressWarnings("resource") final ReusableStringReader strReader =
        (components == null || components.reusableStringReader == null) ?
        new ReusableStringReader() : components.reusableStringReader;
    strReader.setValue(text);
    final Reader r = initReader(fieldName, strReader);
    if (components == null) {
      components = createComponents(fieldName, r);
      reuseStrategy.setReusableComponents(this, fieldName, components);
    } else {
      components.setReader(r);
    }
    components.reusableStringReader = strReader;
    return components.getTokenStream();
  }

2 ReuseStrategy
在获得TokenStream中有ReuseStrategy的设置。ReuseStrategy用来确定TokenStreamComponents的重用策略。在Analyzer中有两个默认的ReuseStrategy,分别是GLOBAL_REUSE_STRATEGY和PER_FIELD_REUSE_STRATEGY。GLOBAL_REUSE_STRATEGY对所有的域使用同样的TokenStreamComponents,而PER_FIELD_REUSE_STRATEGY对每个域使用不同的TokenStreamComponents。对于PER_FIELD_REUSE_STRATEGY,需要一个Map使fieldName和特定的TokenStreamComponents对应起来。

3 TokenStreamComponents

TokenStreamComponents包含了需要的Tokenizer。可以通过createComponents方法使用需要的类的对象。后面的Tokenizer chain也是通过这个方法构建。

4 Tokenizer chain

就是将输入的字符串处理成Token的一条链。使用设计模式中的Decorator模式,一系列的类继承了相同的接口incrementToken,通过层层嵌套完成复杂的功能。下面的代码演示了StandardAnalyzer中是如何建立这条处理链的。

4 Tokenizer chain

就是将输入的字符串处理成Token的一条链。使用设计模式中的Decorator模式,一系列的类继承了相同的接口incrementToken,通过层层嵌套完成复杂的功能。下面的代码演示了StandardAnalyzer中是如何建立这条处理链的。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
    final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
    src.setMaxTokenLength(maxTokenLength);
    TokenStream tok = new StandardFilter(matchVersion, src);
    tok = new LowerCaseFilter(matchVersion, tok);
    tok = new StopFilter(matchVersion, tok, stopwords);
    return new TokenStreamComponents(src, tok) {
      @Override
      protected void setReader(final Reader reader) throws IOException {
        src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
        super.setReader(reader);
      }
    };
  }
如果需要生成自己的Tokenizer链,也使用类似的方法。
在http://blog.csdn.net/liweisnake/article/details/12568209中,列出了其他的一些由Lucene提供的Tokenizer和TokenFilter。

5 AttributeSource

TokenStream继承自AttributeSource,而Tokenizer和TokenFilter继承自TokenStream。AttributeSource包括了两个重要的成员变量,即:attributes和attributeImpls。Attributes是接口,而attributeImpls是接口的实现。它们记录并处理Tokenize过程中的底层数据。
下图是一些tokenattributes:
20140529 一些tokenattributes
Please read full article from Lucene (v4.5) 深入学习 (九) —— 分析器


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts