今際の国の呵呵君: [System Design]TypeAhead



今際の国の呵呵君: [System Design]TypeAhead

Scenario: Design a typeahead service which support following functionalities:

  • Be able to collect the data and keep track of the sentences users has typed recently
  • For every word user typed, return the top k hot sentences start with this prefix
Services:

Data Collection Service: collect the data form users and do preprocessing to get frequency of each service

Query Service: Store hot words for every prefix and return the top k hot words for given query request. Let's assume there are 1B active user and each of them do 10 searches daily and average sentence length is 10(every time user type a char we may need to search for the new prefix), QPS = 1B * 10 * 10 / 86400 = 1M, peak QPS = 2 * 1M = 2M. It is a pretty ready heavy service.

Storage:

Where do we get the data?

We can get the data by logging, each time user typed a full sentence and send a request, we will log the data in the disk, like <user_id, sentence, timestamp>. The log information can be stored in distributed file system like GFS, since usually the new generated data every day is really large. Let's use the estimated number above, assume each log entry will be 20 bytes the data generated every day will be 1B * 10 * 20 Bytes =  200G. And if we want to the data for past two weeks, it will be 200G * 14 = 2.8 T.
After we have the log data, we can perform the map reduce to get the <sentence, frequency> pair we  want, since it is a simple key-value pair, we can store it in distributed nosql database like BigTable
It will still be pretty expensive to store this large amount of data, actually we don't want the service to be that accurate, we just want to know the general trend. This is why we need to bring up probabilistic logging, every time we can generate a random number between [0, 1000) and if the number equals to 0, we log this data, so generally we are logging with 1 / 1000 probability. And the data generate every day can be reduced from 200G to 200M.

Read full article from 今際の国の呵呵君: [System Design]TypeAhead


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts