Java – Does record splitting need to generate a unique key for each record in Hadoop?

Does record splitting need to generate a unique key for each record in Hadoop?… here is a solution to the problem.

Does record splitting need to generate a unique key for each record in Hadoop?

I am new to the Hadoop world. I’ve been following examples I can find to see how record splitting steps work for MapReduce jobs. I noticed that TextInputFormat splits the file into records, with keys as byte offsets and values as strings. In this case, we can have two different records in one mapper with the same offset from different input files.

Does it affect the mapper in any way? I think the uniqueness of the mapper key is irrelevant if we don’t handle it (e.g. wordcount). But if we have to handle it in Mapper, then the key probably has to be unique. Can anyone elaborate?

Thanks in advance.

Solution

The input to the mapper is a file (or HDFS block) rather than a key-value pair. In other words, the mapper itself creates key-value pairs and is not affected by duplicate keys.

The “final” output generated by Mapper is a multivalue HashMap.

< Key, <List of Values>>

This output becomes the input to the reducer. All values of a key are handled by the same reducer. A mapper can create multiple values for a key. In fact, some solutions rely on this behavior.

Related Problems and Solutions