lucene4.7 分词器(三) 之特殊分词器



lucene4.7 分词器(三) 之特殊分词器

public class PatternAnalyzer  extends Analyzer {
     
    String regex;//使用的正则拆分式
    public PatternAnalyzer(String regex) {
         this.regex=regex;
    }
 
    @Override
    protected TokenStreamComponents createComponents(String arg0, Reader arg1) {
        return new TokenStreamComponents(new PatternTokenizer(arg1, Pattern.compile(regex),-1));
    }
}

public class China extends Tokenizer {
     
     public China(Reader in) {
          super(in);
        }
 
        public China(AttributeFactory factory, Reader in) {
          super(factory, in);
        }
            
        private int offset = 0, bufferIndex=0, dataLen=0;
        private final static int MAX_WORD_LEN = 255;
        private final static int IO_BUFFER_SIZE = 1024;
        private final char[] buffer = new char[MAX_WORD_LEN];
        private final char[] ioBuffer = new char[IO_BUFFER_SIZE];
 
 
        private int length;
        private int start;
 
        private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
        private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
         
        private final void push(char c) {
 
            if (length == 0) start = offset-1;            // start of token
            buffer[length++] = Character.toLowerCase(c);  // buffer it
 
        }
 
        private final boolean flush() {
 
            if (length>0) {
                //System.out.println(new String(buffer, 0,
                //length));
              termAtt.copyBuffer(buffer, 0, length);
              offsetAtt.setOffset(correctOffset(start), correctOffset(start+length));
              return true;
            }
            else
                return false;
        }
 
        @Override
        public boolean incrementToken() throws IOException {
            clearAttributes();
 
            length = 0;
            start = offset;
 
 
            while (true) {
 
                final char c;
                offset++;
 
                if (bufferIndex >= dataLen) {
                    dataLen = input.read(ioBuffer);
                    bufferIndex = 0;
                }
 
                if (dataLen == -1) {
                  offset--;
                  return flush();
                else
                    c = ioBuffer[bufferIndex++];
 
 
                switch(Character.getType(c)) {
 
                case Character.DECIMAL_DIGIT_NUMBER://注意此部分不过滤一些熟悉或者字母
                case Character.LOWERCASE_LETTER://注意此部分
                case Character.UPPERCASE_LETTER://注意此部分
//                    push(c);
//                    if (length == MAX_WORD_LEN) return flush();
//                    break;
              
                case Character.OTHER_LETTER:
                    if (length>0) {
                        bufferIndex--;
                        offset--;
                        return flush();
                    }
                    push(c);
                    return flush();
 
                default:
                    if (length>0return flush();
                      
                        break;
                     
                }
            }
        }
         
        @Override
        public final void end() {
          // set final offset
          final int finalOffset = correctOffset(offset);
          this.offsetAtt.setOffset(finalOffset, finalOffset);
        }
 
        @Override
        public void reset() throws IOException {
          super.reset();
          offset = bufferIndex = dataLen = 0;
        }
 
}
Please read full article from lucene4.7 分词器(三) 之特殊分词器

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts