All About Programming: boilerpipe, or how to extract information from web pages with minimal fuss

final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(url));
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
System.out.println("Page title: " + doc.getTitle());
obtaining images
URL url = new URL("http://www.spiegel.de/wissenschaft/natur/0,1518,789176,00.html");
// choose from a set of useful BoilerpipeExtractors...
final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
final ImageExtractor ie = ImageExtractor.INSTANCE;
List<Image> imgUrls = ie.process(url, extractor);
// automatically sorts them by decreasing area, i.e. most probable true positives come first
Collections.sort(imgUrls);

for(Image img : imgUrls) {
System.out.println("* "+img);
}

Read full article from boilerpipe, or how to extract information from web pages with minimal fuss
References:
https://code.google.com/p/boilerpipe/source/browse/trunk/boilerpipe-core/src/demo/de/l3s/boilerpipe/demo/ImageExtractorDemo.java?r=159

boilerpipe, or how to extract information from web pages with minimal fuss

No comments:

Post a Comment

Labels

Popular Posts