Java – mapreduce. TextInputFormat hadoop

mapreduce. TextInputFormat hadoop… here is a solution to the problem.

mapreduce. TextInputFormat hadoop

I am a hadoop beginner. I came across this custom RecordReader program , read 3 rows at a time and output 3 rows of input the number of times the input was provided to the mapper.

I can understand why you would use RecordReader, but when the input format class is essentially an extension mapreduce. When I was in the TextInputFormat class, I couldn’t understand how each InputSplit could contain 3 lines.
As I understand it, the TextInputFormat class emits 1 InputSplit per line (each\n).

So how does RecordReader read 3 rows from each InputSplit? Please someone explain how this is possible.
Thanks in advance!

Solution

You need to know the implementation of TextInputFormat to find out.

Let’s dive into the code. I’ll talk about the new mapreduce API, but the “old” mapred API is very similar.

As you said, from the user’s perspective, TextInputFormat splits split into records based on some newline characters. Let’s < a href=" https://github.com/apache/hadoop-common/blob/1f2a21ff9b812ec9667ce7561f9c02a45dfe5674/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java#L42" rel="noreferrer noopener nofollow">check the implementation .

You can see that the class is almost empty. The key function is createRecord, defined by InputFormat

@Override
public RecordReader<LongWritable, Text> createRecordReader(
        InputSplit split, 
        TaskAttemptContext context
) {
   return new LineRecordReader();
}

The general convention is to use InputFormat to get a RecordReader. If you look inside Mapper and MapContextImpl, you’ll see that the mapper only uses RecordReader to get the next key and value. He doesn’t know anything.

Mapper:

public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  while (context.nextKeyValue()) {
    map(context.getCurrentKey(), context.getCurrentValue(), context);
  }
  cleanup(context);

MapContextImpl:

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
  return reader.nextKeyValue();
}

Now re-read the link you provided carefully. You will see:

  • NLinesInputFormat extends TextInputFormat and overrides only createRecordReader. Basically, you can use LineReader to provide your own RecordReader. You want to extend the TextInputFormat instead of another class higher in the hierarchy because it already handles everything done at this level that you might need (compressed, non-splittable formatting, etc
  • .).

  • NLinesRecordReader does the real work. In initialize, it performs the operations needed to get the InputStream from the correct offset of the provided InputSplit. It also creates a LineReader, the same as the one used by TextInputFormat
  • In the nextKeyValue method, you will see that LineReader.readLine() is called three times to get three rows (plus some logic to properly handle corner cases such as large records, line endings, split ends).

Hope that helps. The key is to understand the overall design of the API and how each part interacts with each other.

Related Problems and Solutions