Lucene4.3开发之插曲之斗转星移



Lucene4.3开发之插曲之斗转星移
Lucene的分页,总的来说有两种形式,散仙总结如下图表格。(如果存在不合适之处,欢迎指正!) 


编号方式优点缺点
1在ScoresDocs里进行分页无需再次查询索引,速度很快在海量数据时,会内存溢出
2利用SearchAfter,再次查询分页适合大批量数据的分页再次查询,速度相对慢一点,但可以利用缓存弥补


从上图我们可以分析出,ScoreDocs适合在数据量不是很大的场景下进行分页,而SearchAfter则都适合,所以,我们要根据自己的业务需求,合理的选出适合自己的分页方式。 

在我们了解这2中分页技术的优缺点之后,我们再来探讨下上面那个读2亿数据存入txt文本里,在这里,SocreDocs不适合这种场景,当然如果你内存足够大的话,可以尝试下,通用分页分批读取的方式,可以提升我们的写入效率,效果是比单条单条读取的速度是要快很多的。虽然ScoresDocs的分页方式在本需求上不适合,但是作为示例,下面散仙给出使用ScoreDocs进行分页的代码: 

  1. try{  
  2.     directory=FSDirectory.open(new File(indexReadPath));//打开索引文件夹  
  3.     IndexReader  reader=DirectoryReader.open(directory);//读取目录  
  4.     IndexSearcher search=new IndexSearcher(reader);//初始化查询组件  
  5.       
  6.      int pageStart=0;  
  7.      ScoreDoc lastBottom=null;//相当于pageSize  
  8.      while(pageStart<10){//这个只有是paged.scoreDocs.length的倍数加一才有可能翻页操作  
  9.          TopDocs paged=null;  
  10.          paged=search.searchAfter(lastBottom, new MatchAllDocsQuery(),null,30);//查询首次的30条  
  11.          if(paged.scoreDocs.length==0){  
  12.              break;//如果下一页的命中数为0的情况下,循环自动结束  
  13.          }  
  14.          page(search,paged);//分页操作,此步是传到方法里对数据做处理的  
  15.            
  16.          pageStart+=paged.scoreDocs.length;//下一次分页总在上一次分页的基础上  
  17.          lastBottom=paged.scoreDocs[paged.scoreDocs.length-1];//上一次的总量-1,成为下一次的lastBottom  
  18.      }  
  19.      reader.close();//关闭资源  
  20.      directory.close();//关闭连接  
  21.       
  22.     }catch(Exception e){  
  23.         e.printStackTrace();  
  24.     }  
Please read full article from Lucene4.3开发之插曲之斗转星移

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts