Java – Hadoop’s word percentage program

Hadoop’s word percentage program… here is a solution to the problem.

Hadoop’s word percentage program

I’m working on a slightly improved version of the famous WordCount program, which is supposed to output the percentage of words in the book. For example:

...
war 0.00002332423%
peace 0.0034234324%
...

Basically, I need to count all words, count the number of occurrences of each word, divide this set of values by the total count. So there should be at least two jobs:

Job 1

  • Gets the input directory and generates two output directories: output1 and output2
  • Mapper: Write (word, 1) pairs to output1, and ("total_count", 1) pairs to output2
  • Reducer: Summing pairs with the same key in output1 gives (word, n), calculating the total count such that ("total_count", N) is in output2

Job 2

  • Take output1 and output2 as input folders and write the results to output3
  • Mapper: Do nothing, just write down the same pair it gets
  • Reducer: Takes a single value divided by total_count and writes the result to output3

My question:

    I

  1. want to avoid checking the raw input twice, which is why I tried to count the number of words and the total number of words in Job1. But I don’t see how to avoid confusing the results in one output. I have tried using MultipleOutputs, but in this case the result of the mapper does not go into the reducer.

  2. Job2 requires multiple inputs, and it needs to read output2 first, because it is useless to read the result from output1 without the total. I feel like this is the wrong way to use MapReduce (we shouldn’t be using any kind of synchronization) but don’t see the right way.

  3. The mapper in job2 is of no use and is a waste of processor time.

Solution

Thoughts on using a single job:

total_count can be calculated from the mapping stage of the first job. In fact, it already counts as MAP_OUTPUT_RECORDS. This is the sum of all mapped output (key, value) pairs. So, if you always take 1 as the value, then this sum is what you want, which is the total number of words (including duplicates) in the document.

Now, I don’t know if you can get this counter in the reducer’s configuration. You can then output pairs (word, wordCount/MAP_OUTPUT_RECORDS) for each word. I think you can do this by:

New API:

context.getCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_OUTPUT_RECORDS").getValue();

Legacy API:

reporter.getCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_OUTPUT_RECORDS").getValue();

Related Problems and Solutions