Hadoop 比较大文本文件 | Bo Wang's Soliloquise



Hadoop 比较大文本文件 | Bo Wang's Soliloquise

最近工作中经常碰到这样的问题:有两个Sequence File,求出他们之间key的交集或者差集;亦或是有一个包含key的map file,需要在一个大文件中找出所有这些key对应的条目。这样的问题利用hadoop自带的性质可以有巧妙的解法。

在这之前有几个事实需要理解:

1、SequenceFile是hdfs自带的key, value格式文件

2、SequenceFile通常根据reducer的数目被分成许多part

3、每个part内部按照key排序,各个part之间是独立关系,全局的key是无序的

直观的想法是把两个文件排序,然后用merge的方法比较。但是刚才说了sequenceFile是局部有序,全局无序。如果我们把两个大文件都排序好了再一行一行比较不仅效率低下还失去了用hadoop的意义。那么利用局部排序的特点能不能做同样的事?可以!


Read full article from Hadoop 比较大文本文件 | Bo Wang's Soliloquise


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts