Hadoop’s word percentage program
I’m working on a slightly improved version of the famous WordCount program, which is supposed to output the percentage of words in the book. For example:
...
war 0.00002332423%
peace 0.0034234324%
...
Basically, I need to count all words, count the number of occurrences of each word, divide this set of values by the total count. So there should be at least two jobs:
Job 1
- Gets the
input
directory and generates two output directories:output1 and output2
- Mapper: Write (word, 1) pairs to
output1
, and ("total_count", 1)
pairs tooutput2
- Reducer: Summing pairs with the same key in
output1
gives (word, n),
calculating the total count such that("total_count", N)
is inoutput2
Job 2
- Take
output1
andoutput2
as input folders and write the results tooutput3
- Mapper: Do nothing, just write down the same pair it gets
- Reducer: Takes a single value divided by
total_count
and writes the result tooutput3
My question:
- I
want to avoid checking the raw input twice, which is why I tried to count the number of words and the total number of words in Job1. But I don’t see how to avoid confusing the results in one output. I have tried using
MultipleOutputs
, but in this case the result of the mapper does not go into the reducer.Job2 requires multiple inputs, and it needs to read
output2
first, because it is useless to read the result fromoutput1
without the total. I feel like this is the wrong way to use MapReduce (we shouldn’t be using any kind of synchronization) but don’t see the right way.The mapper in job2 is of no use and is a waste of processor time.
Solution
Thoughts on using a single job:
total_count
can be calculated from the mapping stage of the first job. In fact, it already counts as MAP_OUTPUT_RECORDS
. This is the sum of all mapped output (key, value)
pairs. So, if you always take 1 as the value, then this sum is what you want, which is the total number of words (including duplicates) in the document.
Now, I don’t know if you can get this counter in the reducer’s configuration. You can then output pairs (word, wordCount/MAP_OUTPUT_RECORDS)
for each word. I think you can do this by:
New API:
context.getCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_OUTPUT_RECORDS").getValue();
Legacy API:
reporter.getCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_OUTPUT_RECORDS").getValue();