(13) What is the data structure for search engine? - Quora
The term Search Engine is still vast and covers atleast these components1. Crawler: To collect new webpages from the net
2. Index: To ensure super-quick retrieval of required webpages
3. Query Processor: To provide an easy interface for user to query
Each of these can be seen as an individual college-level project, though there are tools and freely available frameworks to build these with just a click.
Crawlers typically use queues to collect the webpages they're yet to visit, and usually a Bloom Filter to mark the pages that are already read. Another alternative is a Hashset. However to avoid situations like spider-traps, some kind of priority queues are used. There is a lot of research on how to modify the rules so that the pages your system prefers are crawled.
Query processor can be as simple as a string spliter or a regex library, or even something as complex as a complete Natural Language Understanding system.
Index is the core. The most common style of indexing (and the data structure used) is called an inverted index. It is a hashmap like data structure that directs you from a word to a document or a web page.
A detailed explanation is given in my 2 year old answer:
Information Retrieval: What is inverted index?
And if you want to read more,
How can one build a search engine for some specific search?
Read full article from (13) What is the data structure for search engine? - Quora
No comments:
Post a Comment