Too Many Words Again! | HathiTrust Digital Library



Too Many Words Again! | HathiTrust Digital Library
Every indexed term that's loaded into RAM creates 4 objects (TermInfo,
Term, String, char[]), as you see in your profiler output.  And each
object has a number of fields, the header required by the JRE, GC
cost, etc. [2]

Even though the tii files took only  about  2.2 GB on disk  (about 750MB per index), once they are read into memory they occupy about 18 GB.
In Solr 1.4  and above there is a feature that lets you configure an “index divisor” for Solr.   If you set this to 2, then Solr will only load every other entry from the tii file into memory; thus halving memory use for the tii file representation in memory.  The downside is that once you have a file pointer into the tis file and seek to it, in the worst case you have to scan twice as many entries.[3]
Here is how Solr is configured to set it to 2:
<!-- To set   the termInfosIndexDivisor, do this: -->

<indexReaderFactory   class="org.apache.solr.core.StandardIndexReaderFactory">
  <int name="termInfosIndexDivisor">2</int>
</indexReaderFactory >
We upgraded the Solr on our test server to Solr 1.4.1 (in production we are currently using a pre-1.4 development release) and ran some tests with different settings.
We  set the termInfosIndexDivisor to 2 and then to 4 and ran a query against all 3 shards to cause the tii files to get loaded into memory.  We then ran jmap to get a histogram dump of the heap.  The table below shows the total memory use for the top 20 classes for each configuration including a base configuration where we don’t set the index divisor.

Base (current production config)
Index divisor =2
Index divisor =4
Total mem use for top 20 classes (GB)
17.9
9.6
6.1
We ran some preliminary tests and thus far have seen no significant impact in terms of response time for index divisors of 2, 4,8, and 16 with base memory use dropping as low as  a little over 1 GB [4].   We plan to do a few more tests to decide on which divisor to use and then to work on JVM tuning (We should be able to eliminate long stop-the-world collections with the proper settings.)  Once we get that done, we plan to upgrade our production Solrs to 1.4.1 and reduce the memory allocated to the JVM from 32 GB to some level possibly as low as 8 GB.  That will leave even more memory for the OS disk cache.  When we finish the tests and come up with a configuration and JVM settings we will report it in this blog.
Next up "Adventures in Garbage Collection" and "Why are there So Many Words?"
Read full article from Too Many Words Again! | HathiTrust Digital Library

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts