如何将Lucene索引写入Hadoop?



如何将Lucene索引写入Hadoop?
Hadoop是Lucene的子项目,现在发展如火如荼,如何利用Hadoop的分布式处理能力,来给Lucene提高建索引的效率呢,如此一来,便能充分利用HDFS的所有优点,但众所周知,HDFS系统,对随机读支持的并不友好,而像Lucene这种全文检索的框架,几乎所有的检索操作,都离不开随机读写的操作,那么如何才能使Lucene结合hadoop完美的工作呢,其实hadoop的版本里,在一个contrib的工具包里面,带了Lucene索引的工具类,不过貌似是用的人很少,散仙没有用过这个,在这里就不多评价了。 
  1.     public static IndexWriter  getIndexWriter() throws Exception{  
  2.           
  3.         Analyzer  analyzer=new SmartChineseAnalyzer(Version.LUCENE_46);  
  4.         IndexWriterConfig    config=new IndexWriterConfig(Version.LUCENE_46, analyzer);  
  5.         Configuration conf=new Configuration();  
  6.         //Path p1 =new Path("hdfs://10.2.143.5:9090/root/myfile/my.txt");  
  7.         //Path path=new Path("hdfs://10.2.143.5:9090/root/myfile");  
  8.         Path path=new Path("hdfs://192.168.75.130:9000/root/index");  
  9.         HdfsDirectory directory=new HdfsDirectory(path, conf);  
  10.         IndexWriter writer=new IndexWriter(directory, config);  
  11.           
  12.         return writer;  
  13.           
  14.     }  

  1.     public static void query(String queryTerm)throws Exception{  
  2.         System.out.println("本次检索内容:  "+queryTerm);  
  3.         Configuration conf=new Configuration();  
  4.         //Path p1 =new Path("hdfs://10.2.143.5:9090/root/myfile/my.txt");  
  5.     //  Path path=new Path("hdfs://192.168.75.130:9000/root/index");  
  6.         Path path=new Path("hdfs://192.168.75.130:9000/root/output/map1");  
  7.         Directory directory=new HdfsDirectory(path, conf);  
  8.         IndexReader reader=DirectoryReader.open(directory);  
  9.         System.out.println("总数据量: "+reader.numDocs());  
  10.         long a=System.currentTimeMillis();  
  11.         IndexSearcher searcher=new IndexSearcher(reader);  
  12.         QueryParser parse=new QueryParser(Version.LUCENE_46, "city"new SmartChineseAnalyzer(Version.LUCENE_46));  
  13.           
  14.          Query query=parse.parse(queryTerm);  
  15.          TopDocs docs=searcher.search(query, 100);  
  16.      System.out.println("本次命中结果:   "+docs.totalHits+"  条" );  
  17.         long b=System.currentTimeMillis();  
  18.         System.out.println("第一次耗时:"+(b-a)+" 毫秒");  
  19.         System.out.println("============================================");  
  20.         long c=System.currentTimeMillis();  
  21.            query=parse.parse(queryTerm);  
  22.               
  23.            docs=searcher.search(query, 100);  
  24.          System.out.println("本次命中结果:   "+docs.totalHits+"  条" );  
  25.         long d=System.currentTimeMillis();  
  26.         System.out.println("第二次耗时:"+(d-c)+" 毫秒");  
  27.           
  28.          reader.close();  
  29.          directory.close();   
  30.     }  

在solr4.4之后的项目,里面已经集成了像HDFS写入索引的jar包,如果你是在solr里面,那么很容易就能够,把索引建在HDFS上,只需要在solrconfig.xml里面配置Directory的实现类为HDFSDirectory即可,但是solr4.4里面的jar仅仅支持,最新版的hadoop,也就2.0之后的,直接在1.x的hadoop里使用,会出现异常,这是由于,2.x和1.x的hadoop的API变化,散仙改了部分源码后,可以支持对1.x的hadoop进行索引,查询操作,在文末,散仙会把这几个类,给上传上来,用时,只需把这几个类导入工程即可。 


上面是散仙测试的例子,经测试,对HDFS上的lucene索引的增删改查都没问题,但有一点需要注意,lucene结合hadoop,确实能大大提升建索引的速度,但是在检索上却没有任何优势,虽然也可以检索,但是速度比较慢,目前的存储实现,是利用了block cache的缓存特性,能使得检索性能差强人意,但是数据量大的时候,检索性能非常糟糕,这一点到现在还没有任何比较好的解决方法,除非,以后给lucene,或solr,增加类似Hbase的数据结构,如此以来,检索上可能会好很多。 

Please read full article from 如何将Lucene索引写入Hadoop?


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts