Python – After getting started quickly, I encountered some problems with Hadoop

After getting started quickly, I encountered some problems with Hadoop… here is a solution to the problem.

After getting started quickly, I encountered some problems with Hadoop

I’m getting an error which I think has something to do with the way I set up the directory :

After running:

hadoop-0.20.205.0/bin/hadoop jar hadoop-0.20.205.0/contrib/streaming/hadoop-streaming-*.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py – Input CS4501 input-output py_wc_out

I get:
packageJobJar: [mapper.py, reducer.py,/tmp/hadoop-ubuntu/hadoop-unjar6120166906857088018/] []/tmp/streamjob1341652915014758694.jar tmpDir=null

12/04

/08 01:34:01 Infomapred. JobClient:
Clean up the staging area hdfs://localhost:9000/tmp/hadoop-ubuntu/mapred/staging/ubuntu/.staging/job_201204080100_0004

12/04/08 01:34:01
Error streaming. StreamJob: Error starting Action, output path already exists: Output directory hdfs://localhost:9000/user/ubuntu/py_wc_out already exists
Streaming job failed!

I think it has something to do with me specifying the core-site.xml file using hdfs, but that’s in the quickstart guide. I don’t understand why I need to specify hdfs next to the localhost address with the port number.

Solution

The problem is that you are trying to run the same job without clearing the output directory. Delete the output directory and then rerun it. You have to do this between each job. Hadoop fails instead of letting you overwrite the directory.

hadoop fs -rmr /user/ubuntu/py_wc_out

Personally, I like that the way to solve this “problem” is to append the timestamp to the output directory on the fly. In this way it will always be unique and you will not have to get rid of the previous run.

hadoop-0.20.205.0/bin/hadoop jar ... -output py_wc_out-`date +%s`

Related Problems and Solutions