Python – Spark runs locally but not in YARN

Spark runs locally but not in YARN… here is a solution to the problem.

Spark runs locally but not in YARN

I work fine in local mode. When I run in YARN mode, I get the following error:

I’m getting this error :

 File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/worker.py", line 79, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/serializers.py", line 196, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/serializers.py", line 127, in dump_stream
    for obj in iterator:
  File "/hdfs15/yarn/nm/usercache/jvy234/filecache/11/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar/pyspark/serializers.py", line 185, in _batched
    for item in iterator:
  File "/home/jvy234/globalHawk.py", line 84, in <lambda>
TypeError: 'bool' object is not callable

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
        org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
        org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
        org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
        org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1319)
        org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)

Line 84 in my script is:

dataSplit = dataFile.map(lambda line: line.split(deli))

Run locally:

spark-submit --master local globalHawk.py -i 20140817_011500_offer_init.dat -s kh_offers_schema4.txt4 -o txt.txt -d "|"

Run the Yarn client:

spark-submit --master yarn-client globalHawk.py -i 20140817_011500_offer_init.dat -s kh_offers_schema4.txt4 -o txt.txt -d "|"

Solution

This issue should be caused by using different versions of Python in the driver and

YARN worker, and can be solved by using the same version of default Python in the driver and YARN worker.

You can also specify the version of python to use in YARN in the following way:

PYSPARK_PYTHON=python2.6 bin/spark-submit xxx

(No YARN cluster available, not tested).

Related Problems and Solutions