Testing Solr schema, analyzers and tokenization - Pathbreak Developer Notebook



Testing Solr schema, analyzers and tokenization - Pathbreak Developer Notebook
Using tests to tune accuracy of search results is very critical. Accuracy of search results depends to a great extent on the analyzers, tokenizers and filters used in the Solr schema.

Testing and refining their behaviour on a standalone Solr server is unproductive and time consuming, involving cycles of deleting documents, stopping server, changing schema, restarting server, and reindexing documents.
It would be desirable if these analyzer tweaks can be tested quickly on small fragments of text to ascertain how they’ll be tokenized and searched, before modifying the solr schema.

Unit testing Solr tokenizers, token filters and char filters 
public static void main(String[] args) {
try {
StringReader inputText = new StringReader("RUNNING runnable");
Map<String, String> tkargs = new HashMap<String, String>();
tkargs.put("luceneMatchVersion", "LUCENE_33");
TokenizerFactory tkf = new WhitespaceTokenizerFactory();
tkf.init(tkargs);
Tokenizer tkz = tkf.create(inputText);
LowerCaseFilterFactory lcf = new LowerCaseFilterFactory();
lcf.init(tkargs);
TokenStream lcts = lcf.create(tkz);
TokenFilterFactory fcf = new SnowballPorterFilterFactory();
Map<String, String> params = new HashMap<String, String>();
params.put("language", "English");
fcf.init(params);
TokenStream ts = fcf.create(lcts);
CharTermAttribute termAttrib = (CharTermAttribute) ts
.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) {
String term = termAttrib.toString();
System.out.println(term);
}
} catch (Exception e) {
e.printStackTrace();
}
}
Functional testing of Solr schema.xml
    For functional tests, it would be more useful if the actual Solr search model itself is tested, instead of testing individual tokenizer chains.
The snippet below shows how the schema.xml can be loaded and then an analysis done on piece of text input and a dummy field, to examine the resulting query and index tokens:

public static void main(String[] args) {
try {
InputSource solrCfgIs = new InputSource(new FileReader(
"solr/conf/solrconfig.xml"));
SolrConfig solrConfig = new SolrConfig(null, solrCfgIs);
InputSource solrSchemaIs = new InputSource(new FileReader(
"solr/conf/schema.xml"));
IndexSchema solrSchema = new IndexSchema(solrConfig, null,
solrSchemaIs);
Map<String, FieldType> fieldTypes = solrSchema.getFieldTypes();
for (Iterator<Entry<String, FieldType>> iter = fieldTypes
.entrySet().iterator(); iter.hasNext();) {
Entry<String, FieldType> entry = iter.next();
FieldType fldType = entry.getValue();
Analyzer analyzer = fldType.getAnalyzer();
System.out.println(entry.getKey() + ":" + analyzer.toString());
}
String inputText = "Proof of the pudding lies in its eating";
FieldType fieldTypeText = fieldTypes.get("text_en");
System.out.println("Indexing analysis:");
Analyzer analyzer = fieldTypeText.getAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("dummyfield",
new StringReader(inputText));
CharTermAttribute termAttr = (CharTermAttribute) tokenStream
.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
System.out.println(termAttr.toString());
}
System.out.println("nnQuerying analysis:");
Analyzer qryAnalyzer = fieldTypeText.getQueryAnalyzer();
TokenStream qrytokenStream = qryAnalyzer.tokenStream("dummyfield",
new StringReader(inputText));
CharTermAttribute termAttr2 = (CharTermAttribute) qrytokenStream
.getAttribute(CharTermAttribute.class);
while (qrytokenStream.incrementToken()) {
System.out.println(termAttr2.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
}
Read full article from Testing Solr schema, analyzers and tokenization - Pathbreak Developer Notebook

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts