Python – Do I need caching if I use RDD multiple times?

Do I need caching if I use RDD multiple times?… here is a solution to the problem.

Do I need caching if I use RDD multiple times?

Let’s say we have the following code.

x = sc.textFile(...)
y = x.map(...)
z = x.map(...)

Is it necessary to cache x here? Wouldn’t caching x let Spark read the input file twice?

Solution

These things don’t have to make Spark read the input twice.

List all possible scenarios:

Example 1: The file is not read once

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD

In this case, it does nothing because there is nothing when converting.

Example 2: The file is read once

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD

Only once will the file be read to make it mapped

Example 3: The file is read twice

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

It will now only read the input file twice because an Action is used at the same time
With the transformation.

Example 4: The file is read once

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(z.count())    #Action of RDD

Example 5: The file is read twice

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

Because action is now used on two different RDDs, it reads it twice.

Example 6: The file is read once

x = sc.textFile(...)    #creation of RDD
y = x.map(...). cache()    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

Even now, two different actions are only used when the RDD is executed and stored in memory. Now the second operation occurs on the cached RDD.

Edit: Additional information

So the question is, what caches and what doesn’t?
Ans: RDDs that you will use repeatedly need to be cached.
Example 7:

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD

So in this case, because we use x again and again. So it is recommended to cache x. Because it doesn’t have to read x to get it from the source code again and again. So, if you are dealing with a lot of data, this will save you a lot of time.

Let’s say you start using or not serializing all RDDs as in-memory/in-disk caches. If any task is performed, if Spark memory is low, then it will start deleting old RDDs using LRU (most recently) policies. Whenever the deleted RDD is used again, it performs all the steps from source to arrival until the RDD is transformed

Related Problems and Solutions