Java – MapReduce problem

MapReduce problem… here is a solution to the problem.

MapReduce problem

I’m trying to implement a Mapreduce program to do word counts on 2 files, then compare the word counts in these files to see which are the most common words….

I noticed that after wordcounting file 1, the result goes to the directory “/data/output1/” with 3 files in it.
– “_success”
– “_log”
– “part-r-00000”
“part-r-00000” is the file that contains the file1 wordcount result. If the filename was generated in real time without me knowing the filename in advance, how do I get my program to read that particular file?

Also, for (key, value) pairs, I added an identifier to the “value” to be able to identify which file the word belongs to and count.

public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
            Text newValue = new Text();
            newValue.set(value.toString() + "_f2");
            context.write(key, newValue);
}

At a later stage, how do I “remove” the identifier so that I can get the “value”?

Solution

Just point your next MR job to /data/output1/. It will read all three files as input, but both _SUCCESS and _logs are empty, so they have no effect on your program. They are simply written like this so that you know that the MR job that wrote to the directory completed successfully.

Related Problems and Solutions