spark 1.3.0, python, avro files, driver classpath set in spark-defaults.conf, but not visible to slaves
I’m using Spark 1.3.0 with Python. I have an app that reads avro files with the following command :
conf = None
rddAvro = sc.newAPIHadoopFile(
fileAvro,
"org.apache.avro.mapreduce.AvroKeyInputFormat",
"org.apache.avro.mapred.AvroKey",
"org.apache.hadoop.io.NullWritable",
KeyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
conf=conf)
In my conf/spark-defaults.conf
, I have the following line:
spark.driver.extraClassPath /pathto/spark-1.3.0/lib/spark-examples-1.3.0-hadoop2.4.0.jar
I set up a cluster of three machines (two masters and one slave):
- If I run
spark-submit -
-master local on master, it works - If I run
spark-submit --master local
on any of the slaves, it will work If I run
sbin/start-all.sh
and thenspark-submit --master spark://cluster-data-master:7077
it fails with the following error:java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter
I can reproduce this bug in the local model by commenting the driver line in the .conf file.
I tried spark-submit
with the appropriate —driver-class-path
but it didn’t work either!
Update the solution
Following the request here worked for me :
- I use
spark-submit --driver-class-path path/to/appropriate.jar
when calling scripts - I don’t have anything related to the jar file in
the spark-defaults.conf file
- I use
forwarding the jar path to the executor
SparkConf().set(...). set("spark.executor.extraClassPath","path/to/appropriate.ja r")
is in the main python file.
I completely gave up using a conf file to set the path. I haven’t tried using the --jars
parameter yet, and as fanfabbb suggests below, it might be worth a try.
Solution
Try running it with the –master yarn-cluster option
Depending on the size of your data, you can allocate more memory per container by adding more numbers to the following configuration parameters:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
spark-submit –master yarn-client –num-executors 5 –driver-cores 8 –driver-memory 50G –executor-memory 44G code_to_run.py