mapreduce. TextInputFormat hadoop
I am a hadoop beginner. I came across this custom RecordReader program , read 3 rows at a time and output 3 rows of input the number of times the input was provided to the mapper.
I can understand why you would use RecordReader, but when the input format class is essentially an extension mapreduce. When I was in the TextInputFormat class, I couldn’t understand how each InputSplit could contain 3 lines.
As I understand it, the TextInputFormat class emits 1 InputSplit per line (each\n).
So how does RecordReader read 3 rows from each InputSplit? Please someone explain how this is possible.
Thanks in advance!
Solution
You need to know the implementation of TextInputFormat
to find out.
Let’s dive into the code. I’ll talk about the new mapreduce API, but the “old” mapred API is very similar.
As you said, from the user’s perspective, TextInputFormat
splits split into records based on some newline characters. Let’s < a href=" https://github.com/apache/hadoop-common/blob/1f2a21ff9b812ec9667ce7561f9c02a45dfe5674/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java#L42" rel="noreferrer noopener nofollow">check the implementation .
You can see that the class is almost empty. The key function is createRecord
, defined by InputFormat
@Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context
) {
return new LineRecordReader();
}
The general convention is to use InputFormat to get a RecordReader. If you look inside Mapper
and MapContextImpl
, you’ll see that the mapper only uses RecordReader to get the next key and value. He doesn’t know anything.
Mapper:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
MapContextImpl:
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return reader.nextKeyValue();
}
Now re-read the link you provided carefully. You will see:
NLinesInputFormat
extendsTextInputFormat
and overrides onlycreateRecordReader
. Basically, you can useLineReader
to provide your ownRecordReader
. You want to extend theTextInputFormat
instead of another class higher in the hierarchy because it already handles everything done at this level that you might need (compressed, non-splittable formatting, etcNLinesRecordReader
does the real work. Ininitialize
, it performs the operations needed to get theInputStream
from the correct offset of the providedInputSplit
. It also creates aLineReader
, the same as the one used byTextInputFormat
- In
the nextKeyValue
method, you will seethat LineReader.readLine()
is called three times to get three rows (plus some logic to properly handle corner cases such as large records, line endings, split ends).
.).
Hope that helps. The key is to understand the overall design of the API and how each part interacts with each other.