由散列表到BitMap的概念与应用(三):面试中的海量数据处理 | Aoho's Blog



由散列表到BitMap的概念与应用(三):面试中的海量数据处理 | Aoho's Blog

在面试软件开发工程师时,经常会遇到海量数据排序和去重的面试题,特别是大数据岗位。

例1:给定a、b两个文件,各存放50亿个url,每个url各占64字节,内存限制是4G,找出a、b文件共同的url?

  • 首先我们最常想到的方法是读取文件a,建立哈希表,然后再读取文件b,遍历文件b中每个url,对于每个遍历,我们都执行查找hash表的操作,若hash表中搜索到了,则说明两文件共有,存入一个集合。
  • 但上述方法有一个明显问题,加载一个文件的数据需要50亿*64bytes = 320G远远大于4G内存,何况我们还需要分配哈希表数据结构所使用的空间,所以不可能一次性把文件中所有数据构建一个整体的hash表。


Read full article from 由散列表到BitMap的概念与应用(三):面试中的海量数据处理 | Aoho's Blog


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts