Java – Hadoop: What should be mapped and what should be reduced?

Hadoop: What should be mapped and what should be reduced?… here is a solution to the problem.

Hadoop: What should be mapped and what should be reduced?

This is my first time using map/reduce. I want to write a program that handles large log files. For example, if I’m working with a log file that contains {Student, College, and GPA} records, and I want to sort all students by college, what is the “map” part and what is the “reduced” part? Although I have read many tutorials and examples, I have some difficulty with this concept.

Thanks!

Solution

Technically, Hadoop MapReduce treats everything as key-value pairs; You only need to define what the key is and what the value is. The signatures of map and reduce are

map: (K1 x V1) -> (K2 x V2) list
reduce: (K2 x V2) list -> (K3 x V3) list

Sort the K2 values in the intermediate shuffle phase between map and reduce.

If your input is like this

Student x (College x GPA)

Then your mapper should only do things with the College value as a key:

map: (s, c, g) -> [(c, s, g)]

With the university as the new key, Hadoop will sort by university for you. Well, your reducer is just plain old-fashioned “identity reducer”.

If you’re sorting in practice (i.e., it’s not a homework), look at Hive, or Pig. Sorting specific columns becomes very simple. However, writing a Hadoop flow job for the tasks you identify here is always educational and will give you a better understanding of mappers and reducers.

Related Problems and Solutions