Buttercola: Twitter: Top k frequent words in a string



Buttercola: Twitter: Top k frequent words in a string

http://www.1point3acres.com/bbs/thread-143017-1-1.html

给一组string "a a b b c c beta beta beta alpha"
n = 2;
要求返回一个List<String> 含有的是出现次数最多的n个, 也就是说如果n=1 list中只有出现次数最多的那个也就是beta, 如果是n=2, list当中出现beta, a b c, 因为a b c都出现两次,算是第二出线多的。. from: 1point3acres.com/bbs 
所以就是n是几,就返回几个数吧,按出现频率来。

开始的思路是, hashmap, + array存出现次数,再arrays.sort(),在获取hashmap当中的key, 印度小哥说不好, 你再想想,我马上说用一个int max 来记录出现次数最多的,找到最多的然后再找第二多的。. 1point3acres.com/bbs
小哥说,不错,这个方法处理数据不大的时候是可以得,接着问,能不能用one pass的形式?
想了一会,疑问句:"PriorityQueue?"  小哥说,嗯,说说吧,就是每个出现单词出现的次数都扔进priorityQueue里面,它自动排序嘛-google 1point3acres
小哥说,那你打代码吧
这里还有一个情况是每一个word之间都是有空格的,两个pointer, start end一个数空格 一个不数空格来获取每一个word。

打完之后,发现小哥说了句:"that looks good for me" 心里搜一个口气。 
最后小哥又让我写了几个Unit test。. Waral 鍗氬鏈夋洿澶氭枃绔�,
我就随便考虑了三种请况。  1. s = null || ""  2. s = "a b b" n =1   3. s = "a b b a b c" n =2. Waral 鍗氬鏈夋洿澶氭枃绔�,
然后用assert(result.get(i从0到N)).equals(目标单词)
小哥说不错。
求人品,要求不多,让我去onsite见识一下场面就可以了。
(准备的豆子问题没有考,也没有自我介绍,上来就码,弄题意弄了将近十分钟,一是信号不算天好,二是印度口音懂的人都懂)

Understand the problem:
这个题目比较麻烦的地方是要求输出具有相同词频的所有string. 比如 n = 2
是要求输出第一大和第二大的所有词。对于这个例子,输出就是
beta (first largest)
a b c (second largest)

所以如果只用 HashMap + PriorityQueue的话不太容易找出来所有前 K 大的数,因为有重复 (e.g. a a b b c c)。并且重复的数不是一起读到PriorityQueue里的。

一个方法是先对所有的输入做词频统计,然后建立倒排索引(inverted index), 同时用 min Heap 来统计 top K 的词频。因为倒排索引可以把所有同样词频的数放到一起,而且在PriorityQueue里面只存词频,因为倒排索引里面已经记录了相应词频的String 是哪些了。


Read full article from Buttercola: Twitter: Top k frequent words in a string


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts