Hadoop: Use the output file of the job as the input file for the second job (FileNotFound)… here is a solution to the problem.
Hadoop: Use the output file of the job as the input file for the second job (FileNotFound)
I’m trying to run the mapreduce program using the output file of the job as the input file for the second job. I have this current code :
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(BookAnalyzer.class);
job.setJobName("N-Gram Extraction");
FileSystem fs = FileSystem.get(conf);
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if (status_list != null) {
for (FileStatus status : status_list) {
FileInputFormat.addInputPath(job, status.getPath());
}
}
Path nGramOutput = new Path(args[1]);
FileOutputFormat.setOutputPath(job, nGramOutput);
job.setMapperClass(BookNGramMapper.class);
job.setReducerClass(BookNGramReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
if(job.waitForCompletion(true)) {
Configuration conf2 = new Configuration();
Job job2 = Job.getInstance(conf2);
job2.setJarByClass(BookAnalyzer.class);
job2.setJobName("Term-frequency");
FileSystem fs2 = FileSystem.get(conf2);
FileStatus[] status_list2 = fs2.listStatus(nGramOutput);
if (status_list2 != null) {
for (FileStatus status : status_list2) {
FileInputFormat.addInputPath(job2, status.getPath());
}
}
FileOutputFormat.setOutputPath(job2, new Path(args[2]));
job2.setMapperClass(TermFreqMapper.class);
job2.setReducerClass(TermFreqReducer.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(IntWritable.class);
System.exit(job2.waitForCompletion(true) ? 0 : 1);
}
I get an error message stating that the input path (nGramOutput) does not exist, but if my first job executes correctly (which it does), the file in args[1] should be created.
So,
args[0] = initial file
args[1] = output file from first job, and input file for second job
args[2] = output file from second job
Any suggestion would be great!
Thanks!
Solution
Here’s one way >chaining jobs .
Try this
Your Class {
private static final String OUTPUT_PATH;
}
main(){
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Job job = Job.getInstance(conf);
job.setJarByClass(BookAnalyzer.class);
job.setJobName("N-Gram Extraction");
Path nGramOutput = new Path(OUTPUT_PATH);
FileInputFormat.addInputPath(job,args[0]);
FileOutputFormat.setOutputPath(job, nGramOutput);
job.setMapperClass(BookNGramMapper.class);
job.setReducerClass(BookNGramReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.waitForCompletion(true);
Configuration conf2 = getConf();
Job job2 = Job.getInstance(conf2);
job2.setJarByClass(BookAnalyzer.class);
job2.setJobName("Term-frequency");
FileInputFormat.addInputPath(job2, nGramOutput);
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job2.setMapperClass(TermFreqMapper.class);
job2.setReducerClass(TermFreqReducer.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(IntWritable.class);
System.exit(job2.waitForCompletion(true) ? 0 : 1);
}
The path used is
args[0] : Input path
nGramOutput : Intermediate output form job1, which acts as input to job2
args[1] : Final output path
So the command to run the job will be
hadoop jar myjar.jar args[0] args[1]
You don’t have to give 3 parameters
args[0],args[1],args[2]