boilerpipe, or how to extract information from web pages with minimal fuss



final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(url));
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
   System.out.println("Page title: " + doc.getTitle());
obtaining images
URL url = new URL("http://www.spiegel.de/wissenschaft/natur/0,1518,789176,00.html");
// choose from a set of useful BoilerpipeExtractors...
final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
final ImageExtractor ie = ImageExtractor.INSTANCE;
List<Image> imgUrls = ie.process(url, extractor);
// automatically sorts them by decreasing area, i.e. most probable true positives come first
Collections.sort(imgUrls);

for(Image img : imgUrls) {
System.out.println("* "+img);
}

Read full article from boilerpipe, or how to extract information from web pages with minimal fuss
References:
https://code.google.com/p/boilerpipe/source/browse/trunk/boilerpipe-core/src/demo/de/l3s/boilerpipe/demo/ImageExtractorDemo.java?r=159

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts