algorithm - what is going on inside of Nutch 2? - Stack Overflow
Short Answer In short, they have developed a webcrawler designed to very efficiently crawl the web from a many computer environment (but which can also be run on a single computer). You can start crawling the web without actually needing to know how they implemented it. The page you reference describes how it is implemented. Technology behind it They make use of Hadoop which is an open source java project which is designed along the same lines of MapReduce. MapReduce is the technology Google uses to crawl and organize the web. I've attended a few lectures on MapReduce/Hadoop, and unfortunately, I don't know if anyone at this time can explain it in a complete and easy-to-understand way (they're kind of opposites). Take a look at the wikipedia page for MapReduce. The basic idea is to send a job to the Master Node, the Master breaks the work up into pieces and sends it (maps it) to the various Worker Nodes (other computers or threads) which perform their assigned sub-task,Read full article from algorithm - what is going on inside of Nutch 2? - Stack Overflow
No comments:
Post a Comment