All About Programming: (13) What is the data structure for search engine?

(13) What is the data structure for search engine? - Quora

The term Search Engine is still vast and covers atleast these components
1. Crawler: To collect new webpages from the net
2. Index: To ensure super-quick retrieval of required webpages
3. Query Processor: To provide an easy interface for user to query

Each of these can be seen as an individual college-level project, though there are tools and freely available frameworks to build these with just a click.

Crawlers typically use queues to collect the webpages they're yet to visit, and usually a Bloom Filter to mark the pages that are already read. Another alternative is a Hashset. However to avoid situations like spider-traps, some kind of priority queues are used. There is a lot of research on how to modify the rules so that the pages your system prefers are crawled.

Query processor can be as simple as a string spliter or a regex library, or even something as complex as a complete Natural Language Understanding system.

Index is the core. The most common style of indexing (and the data structure used) is called an inverted index. It is a hashmap like data structure that directs you from a word to a document or a web page.

A detailed explanation is given in my 2 year old answer:
Information Retrieval: What is inverted index?

And if you want to read more,
How can one build a search engine for some specific search?

Read full article from (13) What is the data structure for search engine? - Quora

(13) What is the data structure for search engine? - Quora

No comments:

Post a Comment

Labels

Popular Posts