Lucene4.3进阶开发之神游北冥(十八)



Lucene4.3进阶开发之神游北冥(十八)
Lucene作为一款优秀的全文检索工具包,自然附带了一些其他比较有用的功能,例如在文本挖掘领域,常常需要统计一些词或短语的TF信息,或者IDF的信息,用来加权某个词条,从而找出某篇新闻,或文献中比较重要的一些关键词或短语,或者我们想得到这些词库的位置信息等等。 

首先,第一个我们来看下如何获取分词后的短语的位置信息,这个功能,主要跟我们的分词器有关系,在分词过程中记录的位置信息,增量信息,载荷等等,我们重点来看下,如何获取位置信息,代码如下: 
  1. public void postion(String word)throws Exception{      
  2.     Analyzer analyzer=new IKAnalyzer();//IK分词  
  3.     TokenStream token=analyzer.tokenStream("a"new StringReader(word));  
  4.     token.reset();  
  5.     CharTermAttribute term=token.addAttribute(CharTermAttribute.class);//term信息  
  6.     OffsetAttribute offset=token.addAttribute(OffsetAttribute.class);//位置数据  
  7.     while(token.incrementToken()){  
  8.       System.out.println(term+"   "+offset.startOffset()+"   "+offset.endOffset());  
  9.     }  
  10.     token.end();  
  11.     token.close();  
第二,我们来看下,如何使用Lucene来获取一片文章中所有短语的词频,这个首先我们的数据是需要索引起来的,并且要开启向量存储的功能,然后我们在去索引里面获取词频,然后,稍作加工,按词频降序输出,由此来直观显示.
  1. FieldType ft=new FieldType();  
  2.         ft.setIndexed(true);//存储  
  3.         ft.setStored(true);//索引  
  4.         ft.setStoreTermVectors(true);  
  5.         ft.setTokenized(true);  
  6.         ft.setStoreTermVectorPositions(true);//存储位置  
  7.         ft.setStoreTermVectorOffsets(true);//存储偏移量  
  8.         Document doc=new Document();  
  9.         doc.add(new Field("name", word, ft));  
  10.         writer.addDocument(doc);  
    1.         Directory directroy=FSDirectory.open(new File("D:\\lucene测试索引\\2014311测试"));  
    2.         IndexReader   reader= DirectoryReader.open(directroy);  
    3.          for (int i = 0; i < reader.numDocs(); i++) {  
    4.                 int docId = i;  
    5.                  System.out.println("第"+(i+1)+"篇文档:");  
    6.                 Terms terms = reader.getTermVector(docId, "name");  
    7.                 if (terms == null)  
    8.                     continue;            
    9.                 TermsEnum termsEnum = terms.iterator(null);  
    10.                 BytesRef thisTerm = null;  
    11.                 while ((thisTerm = termsEnum.next()) != null) {  
    12.                     String termText = thisTerm.utf8ToString();  
    13.                     DocsEnum docsEnum = termsEnum.docs(nullnull);  
    14.                     while ((docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {                    
    15.                          System.out.println("termText:"+termText+" TF:  "+docsEnum.freq());   
    16.                     }  
    17.   
    18.                     }  
    19.                 }  

    最后,我们来看下,如何获取IDF,
    1.            Directory directroy=FSDirectory.open(new File("D:\\lucene测试索引\\2014311测试"));  
    2.             IndexReader   reader= DirectoryReader.open(directroy);  
    3.             List<AtomicReaderContext>  list=reader.leaves();  
    4.             for(AtomicReaderContext ar:list){  
    5.                 String field="name";  
    6.                 AtomicReader areader=ar.reader();  
    7.                 Terms term=areader.terms("name");  
    8.                 TermsEnum tn=term.iterator(null);  
    9.                     
    10.                 BytesRef text;  
    11.                 while((text = tn.next()) != null) {  
    12.                  
    13.                   System.out.println("field=" + field + "; text=" + text.utf8ToString()+"   IDF : "+tn.docFreq()  
    14.                      // +" 全局词频 :  "+tn.totalTermFreq()  
    15.                           );  
    16.                     
    17.                      
    18.                     
    19.               }  
    20.             }  
    Please read full article from Lucene4.3进阶开发之神游北冥(十八)

    No comments:

    Post a Comment

    Labels

    Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

    Popular Posts