Spark runs locally but not in YARN… here is a solution to the problem.
Spark runs locally but not in YARN
I work fine in local mode. When I run in YARN mode, I get the following error:
I’m getting this error :
File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/home/jvy234/globalHawk.py", line 84, in <lambda>
TypeError: 'bool' object is not callable
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1319)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
Line 84 in my script is:
dataSplit = dataFile.map(lambda line: line.split(deli))
Run locally:
spark-submit --master local globalHawk.py -i 20140817_011500_offer_init.dat -s kh_offers_schema4.txt4 -o txt.txt -d "|"
Run the Yarn client:
spark-submit --master yarn-client globalHawk.py -i 20140817_011500_offer_init.dat -s kh_offers_schema4.txt4 -o txt.txt -d "|"
Solution
This issue should be caused by using different versions of Python in the driver and
YARN worker, and can be solved by using the same version of default Python in the driver and YARN worker.
You can also specify the version of python to use in YARN in the following way:
PYSPARK_PYTHON=python2.6 bin/spark-submit xxx
(No YARN cluster available, not tested).