Nodejs,不一样的爬虫实践--花满楼



Nodejs,不一样的爬虫实践--花满楼

做前端以来,对我成长帮助最大的恐怕也就是各位大侠们的博客了,,慢慢的也在我心里种下了颗种子:我也要写博客!哪怕我文笔差,技术菜,难以望其项背,我也要追随大神们的脚步,写写博客,处处留香。摸爬滚打,终是成长;学习分享,与君共勉!小前端初学Nodejs,搭了个简单的博客,捉襟见肘,望大侠路过指导!好了,此处有广告之嫌,进入正题。

关于Nodejs的爬虫程序,百度一大把,是的,我也是百度到的,然后到github上看了看cheerio模块;乍一看,这不就是Jquery嘛,没想到Jquery都能牛到后端操作DOM了啊,恩。。。可以先这么认为和理解吧,因为cheerio确实可以像jquery操作DOM一样操作从远程爬取过来的页面;顿时邪念四起:我要把我喜欢的博客全抓到我的博客站点上去、我要不翻墙而是把谷歌抓过来访问。。。然后就有了我博客上的"爬呀爬"版块,都是爬取的国内顶尖的前端团队呕心沥血写的博客,,保留了作者信息和版权,当然,如果他们看到了,骂我一顿,叫我删掉,我肯定会屁颠屁颠删掉的;

还有就是怎么通过我的站点访问谷歌了,比如:http://www.famanoder.com?url=http://google.com的请求,我res.send(googleBody);恩 ,这样是可以的,可是问题在于:相对路径的资源返回404了,因为相对路径域名取的是我的站点,而我的站点没有对应的目录和资源,自然404了;这时cheerio可就很有用了,在res.send之前可以对body进行些DOM操作啊,貌似有点复杂啊,我正则还不行。。。为了让爬取过来的谷歌可以点击,链接没有404,还需要做一步,就是通过cheerio将所有相对路径补上http://www.famanoder.com?url=http://google.com,绝对路径补上http://www.famanoder.com?url=来将爬取的目标站点所有资源都通过我的博客站点请求;


Read full article from Nodejs,不一样的爬虫实践--花满楼


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts