Java – HOW DO I GENERATE MULTIPLE FILENAMES IN THE HADOOP RUNTIME?

HOW DO I GENERATE MULTIPLE FILENAMES IN THE HADOOP RUNTIME?… here is a solution to the problem.

HOW DO I GENERATE MULTIPLE FILENAMES IN THE HADOOP RUNTIME?

I have some data in csv format.

For example, K1,

K2, data1, data2, data3

Here my mapper passes key as K1K2 to reducer
& values are data1,data2,data3

I want to save this data in multiple files with the file name K1k2 (or the key fetched by the reducer). Now, if I use the MultipleOutputs class, I have to mention the filename before the mapper starts. But here, because I can only determine the key after reading data from the mapper. How should I proceed?

PS I’m a newbie.

Solution

You can generate filenames like this and pass them to MultipleOutputs: in Reducer

public void setup(Context context) {
   out = new MultipleOutputs(context);
   ...
}

public void reduce(Text key, Iterable values, Context context) throws IOException,           InterruptedException {
  for (Text t : values) {
    out.write(key, t, generateFileName(<parameter list... >));
     generateFileName is your function
  }
}

protected void cleanup(Context context) throws IOException, InterruptedException {
  out.close();
}

For more details, read the MultipleOutputs class reference: https://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

Related Problems and Solutions