All About Programming: Simple Web Crawler with crawler4j

Simple Web Crawler with crawler4j | Maduranga's Blogs
. Typical approach of the crawler is first some seed urls are added and the crawler will repeatedly browse all the links in the initial list and add the links to the list and so on.

Crawler4j is a java library that will extremely simplify the process of creating the web crawler.

In the crawler you have to override two basic method. They are,

shouldVisit - this method is called when visiting a given URl to determine whether it should be visited or not.
visit - this method is called when the contents of the given URL is downloaded successfully. You can easily access the URl and the contents of the page from this method.

From the controller class you can control the number of crawlers created, maximum number of pages to visit, maximum depth of crawling and add proxy settings of needed. And from here you can add the seeds of crawling. Seeds is the initial list of crawling. When the pages are visited, all the links in the are added to the list and list will grow eventually.

Now you can start crawling. If you want to access any details of the crawled pages, it can be easily done in the visit method. As you can see it contains all the details of the web page including the html of the page. If you want to extract and details from the page, this is the place to do it.

Read full article from Simple Web Crawler with crawler4j | Maduranga's Blogs

Simple Web Crawler with crawler4j | Maduranga's Blogs

No comments:

Post a Comment

Labels

Popular Posts