Python - Spark Dataframe is not distributed

Spark Dataframe is not distributed… here is a solution to the problem.

Spark Dataframe is not distributed

I don’t understand why my data frame is only on one node.
I have a cluster of 14 machines with 4 physical CPUs on a spark standalone cluster.

I connect and create my Spark context via notebook:

would like to have parallelism of 8 partitions, but when I create a dataframe, I only get one partition:

What am I missing?

Thanks to the ANSER of user8371915, I repartitioned my dataframe (I’m reading a compressed file (.csv.gz), so I understand the splittable.

But when I “count” it, I think it executes on only one performer:

This is on the execution program n°1, even if the file is 700 MB large and on 6 blocks on HDFS.
As I understand it, calculus should be more than 10 cores, more than 5 nodes… But everything is calculated only on one node 🙁

Solution

There are two possibilities:

File size less than spark.sql.files.maxPartitionBytes
Files are compressed using an inseparable format such as gzip.

In the first case you can consider adjusting the parameters, but if you use the default values, it is already small.

In the second case, it’s best to unzip the file before loading into Spark. If you can’t do this, repartition after loading, but it will be slow.

Python – Spark Dataframe is not distributed

Spark Dataframe is not distributed

Solution

Related Problems and Solutions