The Python hadoop mapreduce job uses the mrjob subprocess. CalledProcessError

The Python hadoop mapreduce job uses the mrjob subprocess. CalledProcessError … here is a solution to the problem.

The Python hadoop mapreduce job uses the mrjob subprocess. CalledProcessError

I’m trying to run a basic example from a mrjob website on my custom data. I’ve successfully run a Hadoop map reduce using a stream, and I’ve also successfully tried a script without Hadoop, but now I’m trying to run it on Hadoop via mrjob by executing the following command.

./mapred.py -r hadoop --hadoop-bin /usr/bin/hadoop -o hdfs:///user/cloudera/wc_result_mrjob hdfs:///user/cloudera/books

The mapred.py source code is as follows:

#! /usr/bin/env python

from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Unfortunately, I get the following error:

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/mapred.cloudera.20140824.195414.420162
writing wrapper script to /tmp/mapred.cloudera.20140824.195414.420162/setup-wrapper.sh
STDERR: mkdir: `hdfs:///user/cloudera/tmp/mrjob/mapred.cloudera.20140824.195414.420162/files/': No such file or directory
Traceback (most recent call last):
File "./mapred.py", line 18, in <module>
MRWordFrequencyCount.run()
File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/usr/lib/python2.6/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/usr/lib/python2.6/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/usr/lib/python2.6/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 238, in _run
self._upload_local_files_to_hdfs()
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 265, in _upload_local_files_to_hdfs
self._mkdir_on_hdfs(self._upload_mgr.prefix)
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 273, in _mkdir_on_hdfs
self.invoke_hadoop(['fs', '-mkdir', path])
File "/usr/lib/python2.6/site-packages/mrjob/fs/hadoop.py", line 109, in invoke_hadoop
raise CalledProcessError(proc.returncode, args)
subprocess. CalledProcessError: Command '['/usr/bin/hadoop', 'fs', '-mkdir', 'hdfs:///user/cloudera/tmp/mrjob/mapred.cloudera.20140824.195414.420162/files/']' returned non-zero exit status 1

It seems to me that mrjob can’t create some directories in HDFS, but unfortunately I don’t know how to fix this.

My Hadoop is cloudera CDH5.1. Get started quickly.

Thank you in advance for any suggestions and comments.

Edit:

I tried running the same code on cloudera CDH4.7. Quick start, works great. So the problem I modified is: cloudera CDH5.1. Is the MRJOB framework supported? If yes, then how does it work?

Solution

I’m getting the same error and the workaround I made is to change:

self.invoke_hadoop(['fs', '-mkdir', path])

self.invoke_hadoop(['fs', '-mkdir','-p', path])

The modified file is:
/usr/lib/python2.6/site-packages/mrjob/hadoop.py

MY MRJOB HAS BEEN RUNNING FOR A FEW MONTHS WITHOUT ANY ISSUES, SO I THINK THAT’S FINE.

I myself would like to know the same alternative.

The Python hadoop mapreduce job uses the mrjob subprocess. CalledProcessError

Solution

Related Problems and Solutions