All About Programming: DATA ANALYSIS: 10 popular Linux commands for Hadoop

DATA ANALYSIS: 10 popular Linux commands for Hadoop

1. sort
A good conduct of running Hadoop is to always test the map/reduce programs at the local machine before releasing the time-consuming map/reduce codes to the cluster environment. The sort command simulates the sort and shuffle step necessary for the map/redcue process. For example, I can run the piped commands below to verify whether the Python codes have any bugs.
./mapper.py | sort | ./reducer.py

2. tail
Interestingly, the FS shell at Hadoop only supports the tail command instead of the head command. Then I can only grab the bottom lines of the data stored at Hadoop.
hadoop fs -tail 5 data/web.log.9

3. sed
Sine the FS shell doesn't provide the head command, the alternative solution is to use the sed command that actually has more flexible options.
hadoop fs -cat data/web.log.9 | sed '1,+5!d'

4. stat
The stat command allows me to know the time when the file has been touched.
hadoop fs -stat data/web.log.9

5. awk
The commands that the FS shell supports usually have very few options. For example the du command under the FS shell does not support -sh option to aggregate the disk usage of the sub-directories. In this case, I have to look for help from the awk command to satisfy my need.
hadoop fs -du data | awk '{sum+=$1} END {print sum}'

6. wc
One of the most important things to understand a file located at the Hadoop is to find the number of its total lines.
hadoop fs -cat data/web.log.9 | wc -l

7. cut
The cut command is convenient to select the specified columns at the file. For example, I am able to count the lines for each of the unique groups from the column between the position of #5 and #14.
hadoop fs -cat data/web.log.9 | cut -c 5-14 | uniq -c

8. getmerge
The great thing for the getmerge command is that I am able to fetch all the result after map/reduce to the local file system as a single file.
hadoop fs -getmerge result result_merged.txt

9. grep
I can start a mapper-only job only with the grep command form the Bash shell to search the lines which contain the key words I am interested in. And this is a map-only task.

hadoop jar $STREAMING -D mapred.reduce.tasks=0 -input data -output result -mapper "bash -c 'grep -e Texas'"

10. at and crontab
The at and crontab commnands are my favorite to schedule a job at Hadoop. For example, I would like to use the order below to clean the map/reduce results at midnight.

at 0212  at > hadoop fs -rmr result

Read full article from DATA ANALYSIS: 10 popular Linux commands for Hadoop

DATA ANALYSIS: 10 popular Linux commands for Hadoop

No comments:

Post a Comment

Labels

Popular Posts