Java – Default Record Reader in Hadoop, global or local byte offset

Default Record Reader in Hadoop, global or local byte offset… here is a solution to the problem.

Default Record Reader in Hadoop, global or local byte offset

We know that mappers (and reducers) in Hadoop can only handle key-value pairs as inputs and outputs. A RecordReader is something that converts raw input from a file into key-value pairs. You can write your own “RecordReader”.

The default RecordReader provided by Hadoop is called TextInputFormat, which reads the lines of a text file. The key it issues for each record that is split is the byte offset of the read row (as LongWritable), and the value is the contents of the line until the end \n character (as a text object).

We also know that the mapper for each input file split is instantiated by the platform.

Suppose there is a huge file F stored on HDFS, and its split is stored on several different nodes; File F is row-delimited and is being processed by some jobs that use the default RecordReader. My question is: is the byte offset (used as the key for that line) for each line calculated locally relative to splitting or globally relative to the entire file?

To put it simply, suppose I have a file that consists of 4 lines split into two parts. For simplicity, make each line exactly 1 byte so that the first four rows have byte offsets of 0, 1, 2, 3:

0 - Line 1
1 - Line 2
2 - Line 3
3 - Line 4

So in the mapper that handles this split, line i provides the i-1 key RecordReader by default. The second split may be at another node:

? - Line 5
? - Line 6
? - Line 7
? - Line 8

The question is whether the byte offset is 4,5,6,7 or starts again from 0,1,2,3 from scratch.

Solution

This is the “global” offset.

You can see it in the code from the location where the file split offset initialization location. If it is a very large file, it will be the byte offset where the split occurs. That location is then incremented from there and passed along the line to your mapper code.

Related Problems and Solutions