Solr 4.8 Features Overview



Solr 4.8 Features Overview

Complex Phrase Queries

The complexphrase query parser can produce phrase queries with embedded wildcards and boolean queries.
It works via multiple passes, parsing a query and then re-parsing any phrase queries for additional markup. At query execution time, span queries are generated to implement the complex phrase logic.
The simplest example is a phrase query containing a prefix query:
q={!complexphrase}"apple ip*"
This will match text with both “apple ipod” and “apple ipad”.
One can specify inOrder=false as a localParam to also match “ipod apple” and “ipad apple”.
q={!complexphrase inOrder=false}"apple ip*"
One can also specify a different default field to search with the df localParam:
q={!complexphrase df=name}"john* smith"
This will match both “john smith” and “johnathan smith” in the name field. Of course one could always specify the field directly in the query as well:
q={!complexphrase}name:"john* smith"
Phrase slop works to specify the proximity of the clauses. For example, the following would also match a name of “johnathan q smith”:
q={!complexphrase}name:"john* smith"~1
And of course we can throw in parens, OR clauses, and other complex logic as well:
q={!complexphrase}name:"(aaa OR (bbb* OR ccc)) ddd -eee (fff~1 OR ggg)" AND text:"nnn? (ooo OR ppp) -qqq www"~3
Indexing Child Documents in JSON
Named Config Sets

This is more in the “configuration” category of features. SolrCloud has always allowed multiple collections to share configuration, and now that capability has been brought to Solr’s non-cloud mode.
Since collections can be created or destroyed, we obviously don’t want shared configuration for these collections to be under the collection itself. The default location for config sets is in the “configsets” directory under the solr home (the example solr server currently doesn’t have this directory by default).

Stopwords and Synonyms REST API

Stopwords and Synonyms may now be managed via a REST API!
The new analysis filter types are ManagedStopFilterFactory and ManagedSynonymFilterFactory.
The example schema.xml now contains a field type that uses these new analysis filters:
<!-- A text type for English text where stopwords and synonyms are managed using the REST API -->
<fieldType name="managed_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ManagedStopFilterFactory" managed="english" />
    <filter class="solr.ManagedSynonymFilterFactory" managed="english" />
  </analyzer>
</fieldType>
To test this out, let’s also change the dynamic field *_en to use managed_en:
<dynamicField name="*_en"  type="managed_en"    indexed="true"  stored="true" multiValued="true"/>

Synonyms

After starting the example server, we can retrieve the current english synonyms:
[...]
    "managedMap":{
      "gb":["gib",
        "gigabyte"],
      "happy":["glad",
        "joyful"],
      "tv":["television"]}}}

Lets add a new synonym:
curl -XPUT "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english" -H 'Content-type:application/json' --data-binary '{"mb":["MiB","megabyte"]}'

Before these changes are visible to the actual search or indexing code in Solr, we need to reload the Solr core:

And now we can do a query on a field that matches the dynamicField we set up and can see the results of the new synonym:
[...]
  "debug":{
    "rawquerystring":"foo_en:mb",
    "querystring":"foo_en:mb",
    "parsedquery":"(foo_en:megabyte foo_en:mib)/no_coord",
    "parsedquery_toString":"foo_en:megabyte foo_en:mib",

To delete the stopword we just added:

Stopwords

To retrieve the list of stopwords:
To add a new stopword:
curl -XPUT "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english" -H 'Content-type:application/json' --data-binary '["foo"]'
To delete the stopword we just added:

Other changes

There have been numerous SolrCloud changes, including:
  • A new List collections and cluster status API which clients can use to read collection and shard information instead of reading data directly from ZooKeeper.
  • Some long running SolrCloud commands (like shard splitting) may now be run in “async” mode to avoid client timeouts
  • A new ADDREPLICA command in the Collections API
Other changes include:
  • Solr 4.8 now requires Java7!
  • RegexReplaceProcessorFactory now supports pattern capture group substitution in the replacement string.
  • A DocExpirationUpdateProcessorFactory that can mark documents based on a TTL (time-to-live) and periodically delete expired documents
Read full article from Solr 4.8 Features Overview

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts