New in Solr 4.8: Document Expiration - Lucidworks



New in Solr 4.8: Document Expiration – Lucidworks
The DocExpirationUpdateProcessorFactory provides two features related to the “expiration” of documents which can be used individually, or in combination:
  • Periodically delete documents from the index based on an expiration field
  • Computing expiration field values for documents from a “time to live” (TTL)

Auto-Delete Expired Documents

The biggest aspect of this Update Processor is it’s ability to automatically delete documents based on the values found in an “expiration date” field that you configure. This automatic deletion isn’t part of the normal Update Processor life cycle — it’s executed via a background timer process thread created by the Factory.
To use this automatic deletion feature, you must configure two options on the Factory:
  • expirationFieldName – The name of the expiration field to use
  • autoDeletePeriodSeconds – How often the factory’s timer should trigger a delete to remove the documents
For example, with the configuration below the DocExpirationUpdateProcessorFactory will create a timer thread that wakes up every 30 seconds. When the timer triggers, it will execute a deleteByQuerycommand to remove any documents with a value in the press_release_expiration_date field value that is in the past:
 <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
   <int name="autoDeletePeriodSeconds">30</int>
   <str name="expirationFieldName">press_release_expiration_date</str>
 </processor>
After the deleteByQuery has been executed, a soft commit is also executed usingopenSearcher=true so that search results will no longer see the expired documents.
While the basic logic of “timer goes off, delete docs with expiration prior to NOW” was fairly simple and straight forward to add, a key aspect of making this work well was in a related issue (SOLR-5783) to ensure that the openSearcher=true doesn’t do anything unless there really are changes in the index. This means that you can configure autoDeletePeriodSeconds to very small values, and still rest easy that your search caches won’t get blown away every few seconds for no reason. The openSearcher=truesoft commits will only affect things if there really are changes in the index.

Compute Expiration Date from TTL

The second feature implemented by this Factory (and the key reason it’s implemented as anUpdateProcessorFactory) is to use “TTL” (Time To Live) values associated with documents to automatically generate an expiration date value to put in the expirationFieldName when documents are indexed.
By default, the DocExpirationUpdateProcessorFactory will look for a _ttl_ request parameter on update requests, as well as a _ttl_ field in each doc that is indexed in that request. If either exist, they will be parsed as Date Math Expressions relative to NOW and used to populate the expirationFieldName. The per-document _ttl_ field based values override the per-request _ttl_ parameter.
Both the request parameter and field names use for specifying TTL values can be overridden by configuringttlParamName & ttlFieldName on the DocExpirationUpdateProcessorFactory. They can also be completely disabled by configuring them as null. It’s also possible to use the TTL computation feature to generate expiration dates on documents, with out using the auto-deletion feature simply by not configuring the autoDeletePeriodSeconds option (so that the timer will never run).
For example, in the configuration below, the Factory will look for a time_to_live field in each document, and use that to compute an expiration value for the press_release_expiration_date field. No request parameters will be checked for a TTL override, and no automatic deletion will occur:
 <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
   <str name="expirationFieldName">press_release_expiration_date</str>
   <null name="ttlParamName"/> <!-- ignore _ttl_ request param -->
   <str name="ttlFieldName">time_to_live</str>
   <!-- NOTE: autoDeletePeriodSeconds not specified, no automatic deletes -->
 </processor>
This sort of configuration may be handy if you only want to logically hide documents for search clients based on a per-document TTL using something like: fq=-press_release_expiration_date:[* TO NOW/DAY], but still retain the documents in the index for other search clients.

An In Depth Example

Let’s walk through a full example of both features by modifying the Solr 4.8 example solrconfig.xml to add the following update processor chain:
  <updateRequestProcessorChain default="true">
    <processor class="solr.TimestampUpdateProcessorFactory">
      <str name="fieldName">timestamp_dt</str>
    </processor>
    <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
      <int name="autoDeletePeriodSeconds">30</int>
      <str name="ttlFieldName">time_to_live_s</str>
      <str name="expirationFieldName">expire_at_dt</str>
    </processor>
    <processor class="solr.FirstFieldValueUpdateProcessorFactory">
      <str name="fieldName">expire_at_dt</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
A few things to note about this chain:
  • It contains a simple TimestampUpdateProcessorFactory so that it will be easy to see when these documents were indexed in the query results I show below — but this is not needed forDocExpirationUpdateProcessorFactory to function
  • The DocExpirationUpdateProcessorFactory instance uses a autoDeletePeriodSeconds of 30 seconds and overrides the ttlFieldName – but the _ttl_ request param is still enabled
  • FirstFieldValueUpdateProcessorFactory is configured on the expire_at_dt — this means that if a document is added with an explicit value in the expire_at_dt field, it will be used instead any value that might be added by the DocExpirationUpdateProcessorFactory using the _ttl_request param
Read full article from New in Solr 4.8: Document Expiration – Lucidworks

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts