algorithm - what is going on inside of Nutch 2? - Stack Overflow



algorithm - what is going on inside of Nutch 2? - Stack Overflow

Short Answer In short, they have developed a webcrawler designed to very efficiently crawl the web from a many computer environment (but which can also be run on a single computer). You can start crawling the web without actually needing to know how they implemented it. The page you reference describes how it is implemented. Technology behind it They make use of Hadoop which is an open source java project which is designed along the same lines of MapReduce. MapReduce is the technology Google uses to crawl and organize the web. I've attended a few lectures on MapReduce/Hadoop, and unfortunately, I don't know if anyone at this time can explain it in a complete and easy-to-understand way (they're kind of opposites). Take a look at the wikipedia page for MapReduce. The basic idea is to send a job to the Master Node, the Master breaks the work up into pieces and sends it (maps it) to the various Worker Nodes (other computers or threads) which perform their assigned sub-task,

Read full article from algorithm - what is going on inside of Nutch 2? - Stack Overflow


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts