Designing a Search Engine: Design Patterns for Crawlers | Alejandro Moreno López | Pulse | LinkedIn



Designing a Search Engine: Design Patterns for Crawlers | Alejandro Moreno López | Pulse | LinkedIn

One of the things I've been really passionate about for the last few years are the crawl technologies and how the search engines work. In fact, it is probably when I was in my third year in the University, specialising in Artificial Intelligence when I started to be interested on that field.

It was there when I wrote a small engine in python which was actively crawling your hard disk and indexing all the information in a data base. The idea was that once the user had to search something, all the data was already there, ready to be displayed a lot quicker than the technologies in that time did it.

Time past, I passed the exam for that class, MacOS did something similar which is simply awesome (try to search a file in your computer nowadays, Ha!, beat that), and then my interests moved to the internet world… but I always kept an eye and the original passion for crawling techniques. That's when I started CruiseHunter, a group of algorithms that crawl the web indexing the best offers and prices for… yes, Cruises.

On the beginning CruiseHunter was coded in Ruby, language which I found quite nice to deal with xml/html files and all the problems you could find in crawling information from a site. Some time later, the project is still alive, more than ever I'd say, but now it is living a huge rewriting, using Symfony, Drupal and proper Software Design principles.

My life in Capgemini has basically changed my way of seeing things, and I can say now that the software I write is much more maintainable... I'd say it is even beautiful.


Read full article from Designing a Search Engine: Design Patterns for Crawlers | Alejandro Moreno López | Pulse | LinkedIn


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts