Can reducers include instance variables in Hadoop?
I’ve seen examples of Reducer
having instance variables online
public static class MyReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
private TreeMap<Integer, Long> counts = new TreeMap<Integer, Long>();
public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
/* populate TreeMap */
}
}
If one instance of the MyReduce
object is used to reduce multiple keys, then we should clear count somewhere. Where should we do this? Or maybe an instance of
MyReduce
will be used for a key
. If the key changes, a new instance of MyReduce
is created. Is this right? So the real question is: if we had a reducer task, how many Reducer
objects would we create? One? Or create by key?
Solution
There is one instance of Reducer
per task, not per key. Then, call the reduce function once for each key, so if a reducer has 4 keys after shuffling, then its
reduce
function will be called 4 times.
As for that particular code example, it doesn’t need to clear the variable because I’m guessing it’s storing the value count for each key. Because the function will be called once for each key, it will store the count of each key in TreeMap
using the key as the key of the TreeMap
.