A good POS tagger in about 200 lines of Python « Computational Linguistics
A good POS tagger in about 200 lines of Python Up-to-date knowledge about natural language processing is mostly locked away in academia. And academics are mostly pretty self-conscious when we write. We’re careful. We don’t want to stick our necks out too much. But under-confident recommendations suck, so here’s how to write a good part-of-speech tagger . You should use two tags of history, and features derived from the Brown word clusters distributed here . If you only need the tagger to work on carefully edited text, you should use case-sensitive features, but if you want a more robust tagger you should avoid them because they’ll make you over-fit to the conventions of your training domain. Instead, features that ask “how frequently is this word title-cased, in a large sample from the web?” work well. Then you can lower-case your comparatively tiny training corpus. For efficiency, you should figure out which frequent words in your training data have unambiguous tags,Read full article from A good POS tagger in about 200 lines of Python « Computational Linguistics
No comments:
Post a Comment