Java – Hadoop Java word count adjustment doesn’t work – try to summarize it all

Hadoop Java word count adjustment doesn’t work – try to summarize it all… here is a solution to the problem.

Hadoop Java word count adjustment doesn’t work – try to summarize it all

I’m trying to tweak the word count example here: http://wiki.apache.org/hadoop/WordCount So instead of counting the number of occurrences of each word, it will sum and return the number of words in the input file.

I tried changing the mapper class to write “Sum:” for all words instead of writing words in the current iteration.

That is, replace

 word.set(tokenizer.nextToken());

@class “map” with

 word.set("Sum: ");

The rest of the file remains unchanged.

This way, I think the output of all mappers will arrive at the same reducer, which will eventually sum the number of “sum:”, which will eventually become the number of words in the file.

Meaning:

 word  1
 other 1
 other 1

Produce:

word  1
other 2

What I’m looking forward to is:

 Sum:  1
 Sum:  1
 Sum:  1

Produce:

 Sum: 3

Instead, when I try to run the code, I get a very long map operation that ends up throwing an exception :

RuntimeException: java.io.IOException: Overflow failed

No matter how small the input file is.

Looking forward to your help.
Thanks

Solution

You have an infinite loop. In your code, you need to call

tokenizer.nextToken()

Actually moves the StringTokenizer forward from a word in the line. Otherwise, your mapping operation will never progress.

So you need something like this:

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text sumText = new Text("Sum: ");
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            tokenizer.nextToken(); go to next word
            context.write(sumText, one);
        }
    }
}

There is a better solution without loops, though. You can use ẗhe >countTokens(). StringTokenizer method:

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        context.write(new Text("Sum: "), new IntWritable(tokenizer.countTokens()));
    }
}

Related Problems and Solutions