Java – Spark (Kafka) streaming memory issues

Spark (Kafka) streaming memory issues… here is a solution to the problem.

Spark (Kafka) streaming memory issues

I’m testing my first Spark Streaming pipeline, which processes messages from Kafka. However, after several test runs, I get the following error message
The Java runtime environment is out of memory to continue.

My test data is very small, so this shouldn’t happen. After reviewing the process, I realize that previously submitted spark jobs may not have been completely deleted? enter image description here

I

usually submit assignments like this, and I’m using Spark 2.2.1
/usr/local/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 ~/script/to/spark_streaming.py

Then stop it with ‘Ctrl+C’

The last few lines of the script are as follows:

ssc.start()
ssc.awaitTermination()

Update

After I changed the way I submit the Spark Streaming job (command below), I’m still having the same issue where memory isn’t freed after killing the job. I only started Hadoop and Spark for those 4 EC2 nodes.

/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 --py-files ~/config.py --master spark://<master_IP>:7077 --deploy-mode client  ~/spark_kafka.py

Solution

When you press Ctrl-C, only the submitter process is interrupted and the job itself continues to run. Eventually your system runs out of memory to start a new JVM.

Also, even if you restart the cluster, all previously run jobs will be restarted again.

Read how to stop a running Spark application properly .

Related Problems and Solutions