How does Solr's cursorMark solve deep pagination while being stateless? - Stack Overflow
cursorMark doesn't affect search - the search is still performed as it always is. cursorMark isn't an index or relevant for how the actual search is performed, but it's a strategy to allow efficient pagination through large data sets. This also means that your second question becomes moot, as it doesn't change anything about how the actual search is performed.
The reason why cursorMark solves deep pagination becomes apparent when you consider the case for a cluster of Solr servers, such as when running in SolrCloud mode.
Let's say you have four servers, A
, B
, C
, and D
, and want to retrieve 10 documents starting from row number 400 (we'll assume that one server == one shard of a larger collection to make this easier).
In the regular case, you'll have to start by retrieving (in sorted order, as each node will sort its result set according to your query - this isn't any different from any regular query as it will need to be sorted locally anyway), and then merging:
- 410 documents from server A
- 410 documents from server B
- 410 documents from server C
- 410 documents from server D
You now have to go through 1640 documents to find out what your actual result set will be, as it could just be that the 10 documents you're looking for, all lives on server C
. Or maybe 350 on server B
and the rest on server D
. It's impossible to say without actually retrieving 410 documents form each server. The result set will be merged and sorted until 400 documents have been skipped and 10 documents has been found.
Now say you want 10 documents starting from row 1 000 000 - you'll have to retrieve 1 000 010 documents form each server, and merge and sort through a result set of 4 000 040 documents. You can see this becoming more and more expensive as the number of servers and documents increase, just to increase the starting point by 10 documents.
Instead, let's assume that you know what the global sort order (meaning the lexical sort value of the last document returned) is. The first query, without a cursorMark, will be the same as for regular pagination - get the first 10 documents from each server (since we're starting at the start of the result set (and not from position 400 as in the first example), we only need 10 from each server).
We process these 40 documents (a very manageable size), sort them and retrieve the 10 first documents, and then we include the global sort key (the cursorMark) of the last document. The client then includes this "global sort key" in the request, which allows us to say "OK, we're not interested in any entries that would be sorted in front of this document, as we've already shown those". The next query would then do:
- 10 documents from server A, that would sort after
cursorMark
- 10 documents from server B, that would sort after
cursorMark
- 10 documents from server C, that would sort after
cursorMark
- 10 documents from server D, that would sort after
cursorMark
Read full article from How does Solr's cursorMark solve deep pagination while being stateless? - Stack Overflow
No comments:
Post a Comment