Java – Hadoop recursive mapping

Hadoop recursive mapping… here is a solution to the problem.

Hadoop recursive mapping

I have a requirement that my mapper may produce a new key/value in some cases for another mapper to handle. Is there a sensible way to do this? I considered writing my own custom input format (queue?). ) to achieve this. Any ideas? Thanks!

EDIT: I should clarify

Method one

Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)

Reduction Step 1:
(out1) -> ok
(out2) -> ok
((fooA, barA), (fooB, barB)) -> create Map Step 2
((fooC, barC)) -> also send this to Map Step 2

Map Step 2:
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4

Reduction Step 2:
(out3) -> ok
((fooD, barD)) -> create Map Step 3
(out4) -> ok

Map Step 3:
(fooD, barD) -> out5

Reduction Step 3:
(out5) -> ok

-- no more map steps. finished --

So it’s completely recursive. Some keys/values emit output for reduction, and some generate new keys/values for mapping. I don’t really know how many Map or Reduction steps I might encounter in a given run.

Method two

Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4
(fooD, barD) -> out5

Reduction Step 1:
(out1) -> ok
(out2) -> ok
(out3) -> ok
(out4) -> ok
(out5) -> ok

This method causes the mapper to provide its own list of inputs. I’m not sure which way is simpler to implement in the end.

Solution

The “Method 1” approach of recursion with Hadoop forces you to run the full dataset with Map and Reduce for each “recursive depth”. This means that you have to determine how deep this can go and that you will suffer a huge performance impact.

Can you safely say that recursion depth is finite?

If so, then I would definitely choose “Method 2” and actually build the mapper in a way that performs the required recursion in one mapper call.
It’s simpler and can save a lot of performance.

Related Problems and Solutions