Java - Can reducers include instance variables in Hadoop?

Can reducers include instance variables in Hadoop?… here is a solution to the problem.

Can reducers include instance variables in Hadoop?

I’ve seen examples of Reducer having instance variables online

public static class MyReducer extends Reducer<IntWritable, Text, IntWritable, Text> {

private TreeMap<Integer, Long> counts = new TreeMap<Integer, Long>();

public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        /* populate TreeMap */
    }
}

If one instance of the MyReduce object is used to reduce multiple keys, then we should clear count somewhere. Where should we do this? Or maybe an instance of MyReduce will be used for a key. If the key changes, a new instance of MyReduce is created. Is this right? So the real question is: if we had a reducer task, how many Reducer objects would we create? One? Or create by key?

Solution

There is one instance of Reducer per task, not per key. Then, call the reduce function once for each key, so if a reducer has 4 keys after shuffling, then its reduce function will be called 4 times.

As for that particular code example, it doesn’t need to clear the variable because I’m guessing it’s storing the value count for each key. Because the function will be called once for each key, it will store the count of each key in TreeMap using the key as the key of the TreeMap.

Java – Can reducers include instance variables in Hadoop?

Can reducers include instance variables in Hadoop?

Solution

Related Problems and Solutions