Java – Use hadoop to specify memory limits

Use hadoop to specify memory limits… here is a solution to the problem.

Use hadoop to specify memory limits

I’m trying to run a high memory job on a Hadoop cluster (0.20.203). I modified mapred-site.xml to enforce some memory limits.

  <property>
    <name>mapred.cluster.max.map.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapred.cluster.max.reduce.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapred.cluster.map.memory.mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>mapred.cluster.reduce.memory.mb</name>
    <value>2048</value>
  </property>

In my work, I specify how much memory I need. Unfortunately, even though I’m running my process with –Xmx2g (as a console application, the job will run just fine with so much memory) I need to request more memory for my mapper (as a sub-question, why is that?). Or it was killed.

val conf = new Configuration()
conf.set("mapred.child.java.opts", "-Xms256m -Xmx2g -XX:+UseSerialGC");
conf.set("mapred.job.map.memory.mb", "4096");
conf.set("mapred.job.reduce.memory.mb", "1024");

The reducer

hardly needs any memory because I’m doing an identity reducer.

  class IdentityReducer[K, V] extends Reducer[K, V, K, V] {
    override def reduce(key: K,
        values: java.lang.Iterable[V],
        context:Reducer[K,V,K,V]#Context) {
      for (v <- values) {
        context write (key, v)
      }
    }
  }

However, the reducer is still using a lot of memory. Is it possible to provide reducer with different JVM parameters than mapper? Hadoop kills reducer and claims it’s using 3960 MB of RAM! The reducer ultimately failed. How is this possible?

TaskTree [pid=10282,tipID=attempt_201111041418_0005_r_000000_0] is running beyond memory-limits.
Current usage : 4152717312bytes.
Limit : 1073741824bytes.
Killing task.

Update: Even though I use cat as mapper and uniq as reducer and –Xms512M -Xmx1g -XX:+UseSerialGC my task takes up 2g of virtual memory! This seems to be a luxury of 4 times the maximum heap size.

TaskTree [pid=3101,tipID=attempt_201111041418_0112_m_000000_0] is running beyond memory-limits.
Current usage : 2186784768bytes.
Limit : 2147483648bytes.
Killing task.

Update: original JIRA for changing the configuration format for memory usage In particular, Java users are most interested in physical memory to prevent jitter. I think that’s exactly what I want: I don’t want the node to start the mapper if there is not enough available physical memory. However, these options all appear to be implemented as virtual memory constraints and are difficult to manage.

Solution

Check your ulimit. From Cloudera, on version 0.20.2, but similar issues may apply to later versions:

… if you set mapred.child.ulimit, it’s important that it must be more
than two times the heap size value set in mapred.child.java.opts. For
example, if you set a 1G heap, set mapred.child.ulimit to 2.5GB. Child
processes are now guaranteed to fork at least once, and the fork
momentarily requires twice the overhead in virtual memory.

It’s also possible that programmatically setting mapred.child.java.opts is “too late”; You may want to verify that it will actually work, and if not, put it in your mapred-site.xml.

Related Problems and Solutions