Python – Spark Dataframe is not distributed

Spark Dataframe is not distributed… here is a solution to the problem.

Spark Dataframe is not distributed

I don’t understand why my data frame is only on one node.
I have a cluster of 14 machines with 4 physical CPUs on a spark standalone cluster.
enter image description here

I connect and create my Spark context via notebook:

enter image description here

I

would like to have parallelism of 8 partitions, but when I create a dataframe, I only get one partition:
enter image description here

What am I missing?

Thanks to the ANSER of user8371915, I repartitioned my dataframe (I’m reading a compressed file (.csv.gz), so I understand the splittable.
enter image description here

But when I “count” it, I think it executes on only one performer:
enter image description here
This is on the execution program n°1, even if the file is 700 MB large and on 6 blocks on HDFS.
As I understand it, calculus should be more than 10 cores, more than 5 nodes… But everything is calculated only on one node 🙁

Solution

There are two possibilities:

In the first case you can consider adjusting the parameters, but if you use the default values, it is already small.

In the second case, it’s best to unzip the file before loading into Spark. If you can’t do this, repartition after loading, but it will be slow.

Related Problems and Solutions