HOW DO I GENERATE MULTIPLE FILENAMES IN THE HADOOP RUNTIME?… here is a solution to the problem.
HOW DO I GENERATE MULTIPLE FILENAMES IN THE HADOOP RUNTIME?
I have some data in csv format.
For example, K1,
K2, data1, data2, data3
Here my mapper passes key as K1K2 to reducer
& values are data1,data2,data3
I want to save this data in multiple files with the file name K1k2 (or the key fetched by the reducer). Now, if I use the MultipleOutputs class, I have to mention the filename before the mapper starts. But here, because I can only determine the key after reading data from the mapper. How should I proceed?
PS I’m a newbie.
Solution
You can generate filenames like this and pass them to MultipleOutputs: in Reducer
public void setup(Context context) {
out = new MultipleOutputs(context);
...
}
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
for (Text t : values) {
out.write(key, t, generateFileName(<parameter list... >));
generateFileName is your function
}
}
protected void cleanup(Context context) throws IOException, InterruptedException {
out.close();
}
For more details, read the MultipleOutputs class reference: https://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html