如何将Lucene索引写入Hadoop2.x?



如何将Lucene索引写入Hadoop2.x?
写入2.x的Hadoop相对1.x的Hadoop来说要简单的说了,因为默认solr(4.4之后的版本)里面自带的HDFSDirectory就是支持2.x的而不支持1.x的,使用2.x的Hadoop平台,可以直接把solr的corejar包拷贝到工程里面,即可使用建索引

散仙,是在eclipse上使用eclipse插件来运行hadoop程序,具体要用到的jar包,除了需要用到hadoop2.2的所有jar包外,还需增加lucene和solr的部分jar包.

检索数据时,第一次检索往往比较慢,第一次之后因为有了Block Cache,所以第二次,检索的速度非常快,当然这也跟你机器的配置有关系.

为什么要使用Hadoop建索引? 使用Hadoop建索引可以利用MapReduce分布式计算能力从而大大提升建索引的速度,这一点优势很明显,但美中不足的是在Hadoop上做检索,性能却不怎么好,虽然有了块缓存,但是如果索引被按64M的块被切分到不同的节点上,那么检索的时候,就需要跨机器从各个块上扫描,拉取命中数据,这一点是很耗时的,目前,据散仙所知,还没有比较好的部署在Hadoop上的分布式检索方案,但毫无疑问的是建索引的能力,确实很给力,后面散仙会写如何使用MapReduce来并行构建Lucene索引
Please read full article from 如何将Lucene索引写入Hadoop2.x?

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts