Java – Expected consumption of open file descriptors in Hadoop 0.21.0

Expected consumption of open file descriptors in Hadoop 0.21.0… here is a solution to the problem.

Expected consumption of open file descriptors in Hadoop 0.21.0

Given Hadoop 0.21.0, what assumptions does the framework make about the number of open file descriptors relative to each individual mapping and shrinking operation? Specifically, what sub-action causes Hadoop to open new file descriptors or overflow to disk during job execution?

(This is deliberately omitted .) MultipleOutputs , because it very clearly undermines the guarantees offered by the system. )

My reasoning is simple: I want to make sure that every job I write for Hadoop is guaranteed a limited number of required file descriptors for each mapper or reducer. Hadoop is happy to abstract this away from the programmer, which is generally a good thing, if not dropping another shoe during server management.

I originally asked this question on Server Fault: From the cluster management aspect of things. Since I’m also responsible for programming, this issue is equally relevant here.

Solution

Here’s a post:, which provides some insight into the problem

This happens because more small files are created when you use MultipleOutputs .class.
Say you have 50 mappers then assuming that you don’t have skewed data, Test1 will always generate exactly 50 files but Test2 will generate somewhere between 50 to 1000 files (50Mappers x 20TotalPartitionsPossible) and this causes a performance hit in I/O. In my benchmark, 199 output files were generated for Test1 and 4569 output files were generated for Test2.

This means that, for normal behavior, the number of mappers is exactly equal to the number of open file descriptors. MultipleOutputs obviously skews this number by multiplying the number of mappers by the number of available partitions. The reducers then proceed normally, generating a file (and therefore a file descriptor) for each reduce operation.

Then the problem becomes: most of these files are kept open by each mapper during the spill operation, because the output is happily martialized by the split. Therefore, there is an issue with available file descriptors.

Therefore, the current assumed maximum file descriptor limit should be:

Map phase: number of mappers * total partitions possible

Reduce phase: number of reduce operations * total partitions possible

As we said, that’s it.

Related Problems and Solutions