Spark (Kafka) streaming memory issues
I’m testing my first Spark Streaming
pipeline, which processes messages from Kafka
. However, after several test runs, I get the following error message
The Java runtime environment is out of memory to continue.
My test data is very small, so this shouldn’t happen. After reviewing the process
, I realize that previously submitted spark jobs may not have been completely deleted?
I
usually submit assignments like this, and I’m using Spark 2.2.1
/usr/local/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 ~/script/to/spark_streaming.py
Then stop it with ‘Ctrl+C’
The last few lines of the script are as follows:
ssc.start()
ssc.awaitTermination()
Update
After I changed the way I submit the Spark Streaming job (command below), I’m still having the same issue where memory isn’t freed after killing the job. I only started Hadoop
and Spark
for those 4 EC2 nodes.
/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 --py-files ~/config.py --master spark://<master_IP>:7077 --deploy-mode client ~/spark_kafka.py
Solution
When you press Ctrl-C, only the submitter process is interrupted and the job itself continues to run. Eventually your system runs out of memory to start a new JVM.
Also, even if you restart the cluster, all previously run jobs will be restarted again.