Python - pyspark on the cluster, ensuring that all nodes are used

pyspark on the cluster, ensuring that all nodes are used… here is a solution to the problem.

pyspark on the cluster, ensuring that all nodes are used

Deployment information: "pyspark --master yarn-client -num-executors 16 --driver-memory 16g --executor-memory 2g"

I’m converting a 100,000-line text file (hdfs dfs format) to an RDD object with corpus = sc.textFile("my_file_name"). When I execute corpus.count(), I get 100000. I realized that all these steps were performed on the master node.

Now, my

question is, when I do something like new_corpus=corpus.map(some_function), does pyspark automatically assign the job to all available slaves (16 in my case)? Or do I have to specify something?

Notes:

don’t think anything was actually distributed (or at least not on 16 nodes) because when I do new_corpus.count(), it prints [Stage some_number :> (0+2)/2] instead of [Stage some_number:> (0+16)/16 ]

don’t think executing corpus = sc.textFile("my_file_name",16) is my solution, because the function I want to apply works at the row level, so it should be applied 100,000 times (the goal of parallelization is to speed up the process, like having each slave use 100,000/16 rows). It should not be applied 16 times on 16 subsets of the original text file.

Solution

Your observation is not entirely correct. A stage is not an “executor”. In Spark, we have jobs, tasks, and stages. The job is started by the master driver and then assigned to different minions, where the stage is a collection of tasks with the same shuffle dependencies. In your case, the shuffle only happens once.

To check if the performer is really 16 years old, you must look at Explorer. Usually it is located on port 4040 because you are using yarn.

Also, if you use rdd.map(), then it should be parallelized according to the partition you define and not the executor you set up in sc.textFile(“my_file_name”, numPartitions).

Here’s an overview again:
https://spark.apache.org/docs/1.6.0/cluster-overview.html

Python – pyspark on the cluster, ensuring that all nodes are used

pyspark on the cluster, ensuring that all nodes are used

Solution

Related Problems and Solutions