Salmon Run: Tokenizing Text with ICU4j's RuleBasedBreakIterator
Friday, May 09, 2008 1 2 3 4 Jaguar will sell its new XJ-6 model in the U.S. for a small fortune :-). Expect to pay around USD 120ks. Custom options can set you back another few 10,000 dollars. For details, go to Jaguar Sales or contact xj-6@jaguar.com. ...split it up into sentences, and then into word tokens. I ended up using the RuleBasedBreakIterator from the ICU4j project . Before that, however, I tried and discarded various other alternatives, which I briefly describe below. I first considered splitting the sentence up by whitespace. However, I would not have been able to capture the as a single token. I then considered using a custom set of punctuation characters and whitespace and splitting by that. This is even worse, since words such as 10,000, U.S and XJ-6 are now treated as multiple tokens. I next considered using the word instance of the java.text.BreakIterator.Read full article from Salmon Run: Tokenizing Text with ICU4j's RuleBasedBreakIterator
No comments:
Post a Comment