How does Solr's cursorMark solve deep pagination while being stateless? - Stack Overflow



How does Solr's cursorMark solve deep pagination while being stateless? - Stack Overflow

cursorMark doesn't affect search - the search is still performed as it always is. cursorMark isn't an index or relevant for how the actual search is performed, but it's a strategy to allow efficient pagination through large data sets. This also means that your second question becomes moot, as it doesn't change anything about how the actual search is performed.

The reason why cursorMark solves deep pagination becomes apparent when you consider the case for a cluster of Solr servers, such as when running in SolrCloud mode.

Let's say you have four servers, A, B, C, and D, and want to retrieve 10 documents starting from row number 400 (we'll assume that one server == one shard of a larger collection to make this easier).

In the regular case, you'll have to start by retrieving (in sorted order, as each node will sort its result set according to your query - this isn't any different from any regular query as it will need to be sorted locally anyway), and then merging:

  • 410 documents from server A
  • 410 documents from server B
  • 410 documents from server C
  • 410 documents from server D

You now have to go through 1640 documents to find out what your actual result set will be, as it could just be that the 10 documents you're looking for, all lives on server C. Or maybe 350 on server B and the rest on server D. It's impossible to say without actually retrieving 410 documents form each server. The result set will be merged and sorted until 400 documents have been skipped and 10 documents has been found.

Now say you want 10 documents starting from row 1 000 000 - you'll have to retrieve 1 000 010 documents form each server, and merge and sort through a result set of 4 000 040 documents. You can see this becoming more and more expensive as the number of servers and documents increase, just to increase the starting point by 10 documents.

Instead, let's assume that you know what the global sort order (meaning the lexical sort value of the last document returned) is. The first query, without a cursorMark, will be the same as for regular pagination - get the first 10 documents from each server (since we're starting at the start of the result set (and not from position 400 as in the first example), we only need 10 from each server).

We process these 40 documents (a very manageable size), sort them and retrieve the 10 first documents, and then we include the global sort key (the cursorMark) of the last document. The client then includes this "global sort key" in the request, which allows us to say "OK, we're not interested in any entries that would be sorted in front of this document, as we've already shown those". The next query would then do:

  • 10 documents from server A, that would sort after cursorMark
  • 10 documents from server B, that would sort after cursorMark
  • 10 documents from server C, that would sort after cursorMark
  • 10 documents from server D, that would sort after cursorMark


Read full article from How does Solr's cursorMark solve deep pagination while being stateless? - Stack Overflow


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts