Uncovering mysteries of InputFormat: Providing better control for your Map Reduce execution.
Data split is a fundamental concept in Hadoop Map Reduce framework which defines both the size of individual Map tasks and its potential execution server. The Record Reader is responsible for actual reading records from the input file and submitting them (as key/value pairs) to the mapper. There are quite a few publications on how to implement a custom Record Reader (see, for example, [1]), but the information on splits is very sketchy. Here we will explain what a split is and how to implement custom splits for specific purposes.
Read full article from Uncovering mysteries of InputFormat: Providing better control for your Map Reduce execution.
No comments:
Post a Comment