Java – Hadoop: Use the output file of the job as the input file for the second job (FileNotFound)

Hadoop: Use the output file of the job as the input file for the second job (FileNotFound)… here is a solution to the problem.

Hadoop: Use the output file of the job as the input file for the second job (FileNotFound)

I’m trying to run the mapreduce program using the output file of the job as the input file for the second job. I have this current code :

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(BookAnalyzer.class);
        job.setJobName("N-Gram Extraction");

FileSystem fs = FileSystem.get(conf);
        FileStatus[] status_list = fs.listStatus(new Path(args[0]));
        if (status_list != null) {
            for (FileStatus status : status_list) {
                FileInputFormat.addInputPath(job, status.getPath());
            }
        }
        Path nGramOutput = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, nGramOutput);

job.setMapperClass(BookNGramMapper.class);
        job.setReducerClass(BookNGramReducer.class);

job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        
if(job.waitForCompletion(true)) {
            
Configuration conf2 = new Configuration();
            Job job2 = Job.getInstance(conf2);
            job2.setJarByClass(BookAnalyzer.class);
            job2.setJobName("Term-frequency");

FileSystem fs2 = FileSystem.get(conf2);
            FileStatus[] status_list2 = fs2.listStatus(nGramOutput);
            if (status_list2 != null) {
                for (FileStatus status : status_list2) {
                    FileInputFormat.addInputPath(job2, status.getPath());
                }
            }
            FileOutputFormat.setOutputPath(job2, new Path(args[2]));

job2.setMapperClass(TermFreqMapper.class);
            job2.setReducerClass(TermFreqReducer.class);

job2.setOutputKeyClass(Text.class);
            job2.setOutputValueClass(IntWritable.class);
            
System.exit(job2.waitForCompletion(true) ? 0 : 1);
        }

I get an error message stating that the input path (nGramOutput) does not exist, but if my first job executes correctly (which it does), the file in args[1] should be created.

So,

args[0] = initial file

args[1] = output file from first job, and input file for second job

args[2] = output file from second job

Any suggestion would be great!

Thanks!

Solution

Here’s one way >chaining jobs .

Try this

Your Class {
  private static final String OUTPUT_PATH;
}
main(){
 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(conf);
 Job job = Job.getInstance(conf);
 job.setJarByClass(BookAnalyzer.class);
 job.setJobName("N-Gram Extraction");
 Path nGramOutput = new Path(OUTPUT_PATH);

FileInputFormat.addInputPath(job,args[0]);
 FileOutputFormat.setOutputPath(job, nGramOutput);

job.setMapperClass(BookNGramMapper.class);
 job.setReducerClass(BookNGramReducer.class);

job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);

job.waitForCompletion(true);

Configuration conf2 = getConf();
 Job job2 = Job.getInstance(conf2);
 job2.setJarByClass(BookAnalyzer.class);
 job2.setJobName("Term-frequency");

FileInputFormat.addInputPath(job2, nGramOutput);
 FileOutputFormat.setOutputPath(job,new Path(args[1]));

job2.setMapperClass(TermFreqMapper.class);
 job2.setReducerClass(TermFreqReducer.class);

job2.setOutputKeyClass(Text.class);
 job2.setOutputValueClass(IntWritable.class);

System.exit(job2.waitForCompletion(true) ? 0 : 1);
 }

The path used is

args[0] : Input path
nGramOutput : Intermediate output form job1, which acts as input to job2
args[1] : Final output path

So the command to run the job will be

hadoop jar myjar.jar args[0] args[1]

You don’t have to give 3 parameters

args[0],args[1],args[2]

Related Problems and Solutions