Saturday, November 14, 2009 Some time back, a colleague pointed me to The Easy Way to Extract Useful Text from Arbitary HTML . If you haven't read/know about this already, I suggest that you do, the simplicity of the approach will probably blow you away. The approach has two steps: For each line of input HTML, compute the text density and discard the ones whose densities are below some predefined threshold value. This has the effect of removing heavily marked up text. To the text that remains, train and apply a neural network to remove boiler-plate text such as disclaimers, etc.
Read full article from Salmon Run: Extracting Useful Text from HTML
No comments:
Post a Comment