The dark sides of Lucene - Ayende @ Rahien



The dark sides of Lucene - Ayende @ Rahien
Probably the part that caused us the most pain with Lucene was the fact that it isn’t transactional. That is, it is quite easy to get into situations where the indexes are corrupted. That make it… challenging to use it in a database that needs to ensure consistency. The problem is that it is really not a use case that Lucene is well suited for. In order to ensure that data is saved, we have to commit often, the problem is that in order to ensure good performance, we want to commit less often, but then we will the changes if we crash. For that matter, Lucene doesn’t do any attempt to actually flush the data properly, relying on the OS to do that, a system crash can cause you to lose data even though you “committed” it.
Next, we have the issue of what Lucene call updates. Updates in Lucene are actually just delete/add, and they don’t maintain the same document id (more on that later). Because of that, you usually have to have an additional field in the index that would be your primary key, and you handle updates by first deleting then adding things. That is quite strange, to be fair, and it means that you can’t “extend” an index entry, you have to build it from scratch every time.
Speaking of this, let us talk a bit about deletes. Ignoring for the moment the absolutely horrendous decision to do deletes through thereader, let us talk about how they are actually done. Deletes are recorded in a separate file, and that means that the moment you have any deletes (or, as I mentioned, updates), all the internal statistics are wrong.  We run into this quite often with RavenDB when we are doing things like facets or suggestions. For example, if you have request a suggestion for a user name, it will happily give you suggestions for deletes users, even though we deleted it in Lucene.
It will go away eventually, when it is ready to optimize the index by merging all the files, but in the meantime, it makes  for interesting bug reports.  Speaking of merging, that is another common issue that you have to deal with. In order to ensure optimal performance, you have to be on top of the merge policy. This results in some interesting issues. For RavenDB’s purposes, we do a writer commit after every indexing batch. That means that if you are writing to RavenDB slowly enough, we do a commit after every document write. That result in a lot of segments, and the merge policy would have to do a lot of merges. The problem here is that merges have two distinct costs associated with them.
First, and obviously, you are going to need to write (again) all of the documents in all of the segments you are merging. That is very similar to doing merges in LevelDB ( indeed, in general Lucene’s file format is remarkably similar to SST ). Next, and arguably more interesting / problematic from our point of view is the fact that it also kills all of the caches. Let me try to explain, Lucene uses a lot of caches to speed things up, in fact, most of the sorting is done by using the caches, for example.  That works really well when we are querying normally, because segments are immutable, which makes for great caching. But on a merge, not only have we just invalidate all of our caches, we now need to read, again, all of the data that we just wrote, so we would be able to use it. That can be… costly. And both things can introduce stalls into the system.
The major problem externally with merges is that the document id changes, and that means that you cannot rely on them. It would be much easier if you could send an id out into the world, and get it back later and do something with it, but that isn’t possible with Lucene.
As a good example, let us take the simplest possible query, TermQuery. Go ahead, try to find where it is actually doing the query for matching terms in this code:https://github.com/apache/lucene/blob/LUCENE_2_1/src/java/org/apache/lucene/search/TermQuery.java
That actually happens here:https://github.com/apache/lucene/blob/LUCENE_2_1/src/java/org/apache/lucene/search/TermScorer.java#L79, and it is effectively a side effect of calling reader.termDocs(term) 
Read full article from The dark sides of Lucene - Ayende @ Rahien

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts