Python – What’s wrong with my boto elastic mapreduce jar job stream parameters?

What’s wrong with my boto elastic mapreduce jar job stream parameters?… here is a solution to the problem.

What’s wrong with my boto elastic mapreduce jar job stream parameters?

I’m using the boto library to create workflows in Amazon’s Elastic MapReduce Web Service (EMR). The following code should create a step:

step2 = JarStep(name='Find similiar items',
            jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
            main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
            step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
                       's3n://bucket/output/' + run_id + '/similiar_items/',
                       'SIMILARITY_PEARSON_CORRELATION'
                      ])

When I run the job flow, it always fails and throws this error :

java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/JobContext
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.JobContext

This is the line in the EMR log that calls the java code:

2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION

What’s wrong with the parameters? Java class definitions can be found here:

https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html

Solution

I found a solution to the problem :

  1. You need to specify Hadoop version 0.20 in the job flow parameter
  2. YOU NEED TO RUN THE JAR STEP USING MAHOUT-CORE-0.5-SNAPSHOT-JOB.jar INSTEAD OF USING MAHOUT-CORE-0.5-SNAPSHOT
  3. .jar

  4. If you have extra streaming steps in your workflow, you need to fix the error in boto:
    1. Open the boto/emr/step .py
    2. Change line 138 to “return ‘/home/hadoop/contrib/streaming/hadoop-streaming.jar'”
    3. Save and reinstall boto

This is how the job_flow function is called to run with mahout:

jobid = emr_conn.run_jobflow(name = name,
log_uri = 's3n://'+ main_bucket_name +'/emr-logging/',
enable_debugging=1,
hadoop_version='0.20',
steps=[step1,step2])

Related Problems and Solutions