Grouping and Joining in Lucene/Solr



Grouping and Joining in Lucene/Solr
The parent and each children are stored as documents.
‣ Two types:
‣ Index time join
‣ Query time join

Two block join queries:
‣ ToParentBlockJoinQuery
‣ ToChildBlockJoinQuery
‣ One Lucene collector:
‣ ToParentBlockJoinCollector
‣ Index time join requires block indexing.

Block indexing
‣ Atomically adding documents.
‣ A block of documents.
‣ Each document gets sequentially assigned Lucene document id.
‣ IndexWriter#addDocuments(docs);

‣ Index doesn't record blocks.
‣ Segment merging doesn’t re-order documents in a segment.
‣ App is responsible for identifying block documents.
‣ Marking the last document in a block.
‣ Adding a document to a block requires you to reindex the whole block.
‣ Removing a document from a block doesn’t requires reindexing a block.

‣ Parent is the last document in a block.

Query time joining
‣ Query time joining is executed in two phases and is field based:
‣ fromField
‣ toField
‣ Doesn’t require block indexing.
First phase collects all the terms in the fromField for the documents that match with the original query.
‣ Currently doesn’t take the score from original query into account.
‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField.
‣ Two different implementations:
‣ JoinUtil - Lucene (≥ 3.6)
‣ Join query parser - Solr (trunk)

Joining module has good solutions to model parent child relations.
‣ Use block join if you care about scoring.
‣ Frequent updates can be problematic.
‣ Use query time join for parent child filtering.
‣ Query time join is slower than index time join.

Result grouping
‣ Group matching documents that share a common property.
‣ Search hit represents a group.
‣ Facet counts & total hit count represent groups.
‣ Per group collect information
‣ Most relevant document.
‣ Top three documents.
‣ Aggregated counts

‣ Group documents by a shared property
‣ Product-item by product id (Parent child)
‣ Collapse similar looking documents
‣ E.g. all results from the Wikipedia domains.
‣ Remove duplicates from the search result.
‣ Based on a field that contains a hash

Two pass result grouping.
‣ Grouping by indexed field, function or doc values.
‣ Single pass result grouping.
‣ Requires block indexing.

First pass collects the top N groups.
‣ Per group: group value + sort value
‣ Second pass collects data for each top group.
‣ The top N documents per group.
‣ Possible other aggregated information.
‣ Second pass search ignores all documents outside topN groups.

Parent child result
Result grouping
‣ TopGroups - Equivalent to TopDocs.
‣ Hit count
‣ Group count
‣ Groups
‣ Top documents
‣ Facet and total count can represent groups instead of documents.
‣ But requires more query time.

Compare the parent child solutions
Conclusion
‣ Result grouping
‣ + Distributed support & Parent child relation as hit.
‣ - Parent data duplication
‣ - Impact on query time
‣ Joining
‣ + Fast & no data duplication
‣ - Index time join not optimal for updates
‣ - Query time join is limited.

Compound documents.
‣ + Fast and works out-of-the box with all features.
‣ - Not flexible when it comes to updates.
‣ - Document granularity is set in stone.
Please read full article from Grouping and Joining in Lucene/Solr

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts