MRJob error when running on a hadoop cluster
I’m trying to run a python job with a hadoop cluster and MRJob, and my wrapper script is as follows:
#!/bin/bash
. /etc/profile
module load use.own
module load python/python2.7
module load python/mrjob
python path_to_python-script/mr_word_freq_count.py path_to_input_file/input.txt -r hadoop `> path_to_output_file/output.txt #note the output file already exists before I submit the job`
So once I commit this script to the cluster using qsub myscript.sh
I get two files, an output file and an error file :
The error file reads as follows:
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
Traceback (most recent call last):
File "homefolder/privatemodules/python/examples/mr_word_freq_count.py", line 37, in <module>
MRWordFreqCount.run()
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/job.py", line 500, in run
mr_job.execute()
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/job.py", line 518, in execute
super(MRJob, self).execute()
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/launch.py", line 206, in run_job
with self.make_runner() as runner:
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/job.py", line 541, in make_runner
return super(MRJob, self).make_runner()
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/launch.py", line 164, in make_runner
return HadoopJobRunner(**self.hadoop_job_runner_kwargs())
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/hadoop.py", line 179, in __init__
super(HadoopJobRunner, self).__init__(**kwargs)
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/runner.py", line 352, in __init__
self._opts = self. OPTION_STORE_CLASS(self.alias, opts, conf_paths)
File "/homefolder/.local/lib/python2.7/site-packages/mrjob/hadoop.py", line 132, in __init__
'you must set $HADOOP_HOME, or pass in hadoop_home explicitly')
Exception: you must set $HADOOP_HOME, or pass in hadoop_home explicitly
FIRST QUESTION: HOW DO I FIND $HADOOP HOME? When I do echo $HADOOP_HOME, nothing is printed, which means it is not set. So even though I have to set it, what path do I have to set it to? Should it be set to the path of Hadoop name_node in the cluster?
The second question: What does the “configuration not found” error mean? Does it have something to do with not setting $HADOOP_HOME, or does it expect to explicitly pass in some other configuration file?
Any help is greatly appreciated.
Thanks in advance!
Solution
First, $HADOOP
_
HOME should be set to your machine’s local Hadoop installation path, almost all Hadoop applications assume that $HADOOP_HOME/bin/Hadoop
is a hadoop executable. So if you install your Hadoop in the system default path, you should export HADOOP_HOME=/usr
/, otherwise you should export HADOOP_HOME=/path/to/hadoop
Second, you can provide a specific configuration for mrjob, and if not, mrjob will use automatic configuration. In most cases, providing HADOOP_HOME
and using automatic configuration will do, for advanced users, see http://pythonhosted.org/mrjob/guides/configs-basics.html