All About Programming: 如何将Lucene索引写入Hadoop？

如何将Lucene索引写入Hadoop？
Hadoop是Lucene的子项目，现在发展如火如荼，如何利用Hadoop的分布式处理能力，来给Lucene提高建索引的效率呢，如此一来，便能充分利用HDFS的所有优点，但众所周知，HDFS系统，对随机读支持的并不友好，而像Lucene这种全文检索的框架，几乎所有的检索操作，都离不开随机读写的操作，那么如何才能使Lucene结合hadoop完美的工作呢，其实hadoop的版本里，在一个contrib的工具包里面，带了Lucene索引的工具类，不过貌似是用的人很少，散仙没有用过这个，在这里就不多评价了。

public static IndexWriter getIndexWriter() throws Exception{
Analyzer analyzer=new SmartChineseAnalyzer(Version.LUCENE_46);
IndexWriterConfig config=new IndexWriterConfig(Version.LUCENE_46, analyzer);
Configuration conf=new Configuration();
//Path p1 =new Path("hdfs://10.2.143.5:9090/root/myfile/my.txt");
//Path path=new Path("hdfs://10.2.143.5:9090/root/myfile");
Path path=new Path("hdfs://192.168.75.130:9000/root/index");
HdfsDirectory directory=new HdfsDirectory(path, conf);
IndexWriter writer=new IndexWriter(directory, config);
return writer;
}

public static void query(String queryTerm)throws Exception{
System.out.println("本次检索内容: "+queryTerm);
Configuration conf=new Configuration();
//Path p1 =new Path("hdfs://10.2.143.5:9090/root/myfile/my.txt");
// Path path=new Path("hdfs://192.168.75.130:9000/root/index");
Path path=new Path("hdfs://192.168.75.130:9000/root/output/map1");
Directory directory=new HdfsDirectory(path, conf);
IndexReader reader=DirectoryReader.open(directory);
System.out.println("总数据量: "+reader.numDocs());
long a=System.currentTimeMillis();
IndexSearcher searcher=new IndexSearcher(reader);
QueryParser parse=new QueryParser(Version.LUCENE_46, "city", new SmartChineseAnalyzer(Version.LUCENE_46));
Query query=parse.parse(queryTerm);
TopDocs docs=searcher.search(query, 100);
System.out.println("本次命中结果: "+docs.totalHits+" 条" );
long b=System.currentTimeMillis();
System.out.println("第一次耗时:"+(b-a)+" 毫秒");
System.out.println("============================================");
long c=System.currentTimeMillis();
query=parse.parse(queryTerm);
docs=searcher.search(query, 100);
System.out.println("本次命中结果: "+docs.totalHits+" 条" );
long d=System.currentTimeMillis();
System.out.println("第二次耗时:"+(d-c)+" 毫秒");
reader.close();
directory.close();
}

在solr4.4之后的项目，里面已经集成了像HDFS写入索引的jar包，如果你是在solr里面，那么很容易就能够，把索引建在HDFS上，只需要在solrconfig.xml里面配置Directory的实现类为HDFSDirectory即可，但是solr4.4里面的jar仅仅支持，最新版的hadoop，也就2.0之后的，直接在1.x的hadoop里使用，会出现异常，这是由于，2.x和1.x的hadoop的API变化，散仙改了部分源码后，可以支持对1.x的hadoop进行索引，查询操作，在文末，散仙会把这几个类，给上传上来，用时，只需把这几个类导入工程即可。

上面是散仙测试的例子，经测试，对HDFS上的lucene索引的增删改查都没问题，但有一点需要注意，lucene结合hadoop，确实能大大提升建索引的速度，但是在检索上却没有任何优势，虽然也可以检索，但是速度比较慢，目前的存储实现，是利用了block cache的缓存特性，能使得检索性能差强人意，但是数据量大的时候，检索性能非常糟糕，这一点到现在还没有任何比较好的解决方法，除非，以后给lucene，或solr，增加类似Hbase的数据结构，如此以来，检索上可能会好很多。

Please read full article from 如何将Lucene索引写入Hadoop？

如何将Lucene索引写入Hadoop？

No comments:

Post a Comment

Labels

Popular Posts