Lucene and Hyphens | Entelligentsia Blog



Lucene and Hyphens | Entelligentsia Blog
I have used Apache Lucene for providing an intelligent intuitive search for my products many times over.  Well that was till yesterday! No, I did not ditch Lucene.  But what threw a monkey in the wrench was the ‘intuitive’ nature of the product.  Let me explain. Say I have a string ‘abc-def’ to be indexed.  The say Lucene works (well actually the StandardAnalyzer that delegates the tokenizing through StandardTokenizer works) is by splitting ‘abc-def’ into ‘abc’, ‘def’ to index the two words separately.

tandardAnalyzer under its hoods delegates most of the heavy lifting toStandardTokenizer - which to my shock was a generated java file.  The source was inflex; written specifically for a generator unknown to me yesterday – JFlex.  As we speak we are best of buddies.  Alright then out came the power chord and I got down toRTFM of JFlex.  An hour and a few test cases a later I was reasonably confident to perform an invasive surgery on the StandardTokenizerImpl.jflex.  Or so I thought.

http://pastebin.com/vumxg01w , This is my complete Jflex file. corresponding java file is an auto generated one. It is not very helpful for reading. But still if it is of any use to you, her is the linkhttp://pastebin.com/QApDqeNU.
Read full article from Lucene and Hyphens | Entelligentsia Blog

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts