Python – spark 1.3.0, python, avro files, driver classpath set in spark-defaults.conf, but not visible to slaves

spark 1.3.0, python, avro files, driver classpath set in spark-defaults.conf, but not visible to slaves… here is a solution to the problem.

spark 1.3.0, python, avro files, driver classpath set in spark-defaults.conf, but not visible to slaves

I’m using Spark 1.3.0 with Python. I have an app that reads avro files with the following command :

conf = None

rddAvro = sc.newAPIHadoopFile(
    fileAvro,
    "org.apache.avro.mapreduce.AvroKeyInputFormat",
    "org.apache.avro.mapred.AvroKey",    
    "org.apache.hadoop.io.NullWritable",
    KeyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
    conf=conf)

In my conf/spark-defaults.conf, I have the following line:

spark.driver.extraClassPath /pathto/spark-1.3.0/lib/spark-examples-1.3.0-hadoop2.4.0.jar

I set up a cluster of three machines (two masters and one slave):

  • If I run spark-submit --master local on master, it works
  • If I run spark-submit --master local on any of the slaves, it will work
  • If I run sbin/start-all.sh and then spark-submit --master spark://cluster-data-master:7077 it fails with the following error:

    java.lang.ClassNotFoundException:
    org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter
    

I can reproduce this bug in the local model by commenting the driver line in the .conf file. I tried spark-submit with the appropriate —driver-class-path but it didn’t work either!

Update the solution

Following the request here worked for me :

  • I use spark-submit --driver-class-path path/to/appropriate.jar when calling scripts
  • I don’t have anything related to the jar file in the spark-defaults.conf file
  • I use
    forwarding the jar path to the executor
    SparkConf().set(...). set("spark.executor.extraClassPath","path/to/appropriate.ja r") is in the main python file.

I completely gave up using a conf file to set the path. I haven’t tried using the --jars parameter yet, and as fanfabbb suggests below, it might be worth a try.

Solution

Try running it with the –master yarn-cluster option

Depending on the size of your data, you can allocate more memory per container by adding more numbers to the following configuration parameters:

yarn.nodemanager.resource.memory-mb

yarn.scheduler.maximum-allocation-mb

spark-submit –master yarn-client –num-executors 5 –driver-cores 8 –driver-memory 50G –executor-memory 44G code_to_run.py

Related Problems and Solutions