Salmon Run: Tokenizing Text with ICU4j's RuleBasedBreakIterator



Salmon Run: Tokenizing Text with ICU4j's RuleBasedBreakIterator

Friday, May 09, 2008 1 2 3 4 Jaguar will sell its new XJ-6 model in the U.S. for a small fortune :-). Expect to pay around USD 120ks. Custom options can set you back another few 10,000 dollars. For details, go to Jaguar Sales or contact xj-6@jaguar.com. ...split it up into sentences, and then into word tokens. I ended up using the RuleBasedBreakIterator from the ICU4j project . Before that, however, I tried and discarded various other alternatives, which I briefly describe below. I first considered splitting the sentence up by whitespace. However, I would not have been able to capture the as a single token. I then considered using a custom set of punctuation characters and whitespace and splitting by that. This is even worse, since words such as 10,000, U.S and XJ-6 are now treated as multiple tokens. I next considered using the word instance of the java.text.BreakIterator.

Read full article from Salmon Run: Tokenizing Text with ICU4j's RuleBasedBreakIterator


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts