lucene4.7 正则查询(RegexpQuery)(八)



lucene4.7 正则查询(RegexpQuery)(八)

从某种方式上来说,正则查询(RegexpQuery)跟通配符查询(WildcardQuery)的功能很相似,因为他们都可以完成一样的工作,但是不同的是正则查询支持更灵活定制细化查询,这一点与通配符的泛化是不一样的,而且正则查询天生支持使用强大的正则表达式的来准确匹配一个或几个term,需要注意的是,使用正则查询的字段最好是不分词的,因为分词的字段可能会导致边界问题,从而使查询失败,得不到任何结果,这一点和WildcardQuery效果是一样的。 

RegexpQuery query=new RegexpQuery(new Term(field, ".*"+searchStr+".*"));
TopDocs s=search.search(query,null100);
System.out.println(s.totalHits);
for(ScoreDoc ss:s.scoreDocs){
    Document docs=search.doc(ss.doc);
    System.out.println("id=>"+docs.get("id")+"   name==>"+docs.get("bookName")+"   author==>"+docs.get("author"));
}

分词的字段都是作为一个单独的term来处理的,来lucene的内部匹配方式,恰恰又是以term作为最小检索单位的,故能检索到结果,这一点需要我们格外注意,在实现我们的业务时,要根据自己的场景来设计出最优的分词策略。

1,如果是在不分词的字段里做模糊检索,优先使用正则查询的方式会比其他的模糊方式性能要快。2,在查询的时候,应该注意分词字段的边界问题。3,在使用OR或AND拼接条件查询时或一些特别复杂的匹配时,也应优先使用正则查询。4,大数据检索时,性能尤为重要,注意应避免使用前置模糊的方式,无论是正则查询还是通配符查询。 
Please read full article fromlucene4.7 正则查询(RegexpQuery)(八)

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts