All About Programming: Grouping and Joining in Lucene/Solr

Grouping and Joining in Lucene/Solr
The parent and each children are stored as documents.
‣ Two types:
‣ Index time join
‣ Query time join

Two block join queries:
‣ ToParentBlockJoinQuery
‣ ToChildBlockJoinQuery
‣ One Lucene collector:
‣ ToParentBlockJoinCollector
‣ Index time join requires block indexing.

Block indexing
‣ Atomically adding documents.
‣ A block of documents.
‣ Each document gets sequentially assigned Lucene document id.
‣ IndexWriter#addDocuments(docs);

‣ Index doesn't record blocks.
‣ Segment merging doesn’t re-order documents in a segment.
‣ App is responsible for identifying block documents.
‣ Marking the last document in a block.
‣ Adding a document to a block requires you to reindex the whole block.
‣ Removing a document from a block doesn’t requires reindexing a block.

‣ Parent is the last document in a block.

Query time joining
‣ Query time joining is executed in two phases and is field based:
‣ fromField
‣ toField
‣ Doesn’t require block indexing.
First phase collects all the terms in the fromField for the documents that match with the original query.
‣ Currently doesn’t take the score from original query into account.
‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField.
‣ Two different implementations:
‣ JoinUtil - Lucene (≥ 3.6)
‣ Join query parser - Solr (trunk)

Joining module has good solutions to model parent child relations.
‣ Use block join if you care about scoring.
‣ Frequent updates can be problematic.
‣ Use query time join for parent child filtering.
‣ Query time join is slower than index time join.

Result grouping
‣ Group matching documents that share a common property.
‣ Search hit represents a group.
‣ Facet counts & total hit count represent groups.
‣ Per group collect information
‣ Most relevant document.
‣ Top three documents.
‣ Aggregated counts

‣ Group documents by a shared property
‣ Product-item by product id (Parent child)
‣ Collapse similar looking documents
‣ E.g. all results from the Wikipedia domains.
‣ Remove duplicates from the search result.
‣ Based on a field that contains a hash

Two pass result grouping.
‣ Grouping by indexed field, function or doc values.
‣ Single pass result grouping.
‣ Requires block indexing.

First pass collects the top N groups.
‣ Per group: group value + sort value
‣ Second pass collects data for each top group.
‣ The top N documents per group.
‣ Possible other aggregated information.
‣ Second pass search ignores all documents outside topN groups.

Parent child result
Result grouping
‣ TopGroups - Equivalent to TopDocs.
‣ Hit count
‣ Group count
‣ Groups
‣ Top documents
‣ Facet and total count can represent groups instead of documents.
‣ But requires more query time.

Compare the parent child solutions
Conclusion
‣ Result grouping
‣ + Distributed support & Parent child relation as hit.
‣ - Parent data duplication
‣ - Impact on query time
‣ Joining
‣ + Fast & no data duplication
‣ - Index time join not optimal for updates
‣ - Query time join is limited.

Compound documents.
‣ + Fast and works out-of-the box with all features.
‣ - Not flexible when it comes to updates.
‣ - Document granularity is set in stone.
Please read full article from Grouping and Joining in Lucene/Solr

Grouping and Joining in Lucene/Solr

No comments:

Post a Comment

Labels

Popular Posts