(13) What is the data structure for search engine? - Quora



(13) What is the data structure for search engine? - Quora

The term Search Engine is still vast and covers atleast these components
1. Crawler: To collect new webpages from the net
2. Index: To ensure super-quick retrieval of required webpages
3. Query Processor: To provide an easy interface for user to query

Each of these can be seen as an individual college-level project, though there are tools and freely available frameworks to build these with just a click.

Crawlers typically use queues to collect the webpages they're yet to visit, and usually a Bloom Filter to mark the pages that are already read. Another alternative is a Hashset. However to avoid situations like spider-traps, some kind of priority queues are used. There is a lot of research on how to modify the rules so that the pages your system prefers are crawled.

Query processor can be as simple as a string spliter or a regex library, or even something as complex as a complete Natural Language Understanding system.

Index is the core. The most common style of indexing (and the data structure used) is called an inverted index. It is a hashmap like data structure that directs you from a word to a document or a web page.

A detailed explanation is given in my 2 year old answer:
Information Retrieval: What is inverted index?

And if you want to read more,
How can one build a search engine for some specific search?

Read full article from (13) What is the data structure for search engine? - Quora


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts