Python – PySpark – Error using randomSplit on Dataframe

PySpark – Error using randomSplit on Dataframe… here is a solution to the problem.

PySpark – Error using randomSplit on Dataframe

I created a DataFrame with my data to run some machine learning experiments. I tried to split it into a training set and a test set by using the randomSplit() function, but it gave me some anomalies that I couldn’t figure out why. My code is similar to this:

Features = ['A', 'B', 'C', 'D', 'E', 'aVec', 'bVec', 'cVec', 'dVec']

vec = VectorAssembler(inputCols = Features, outputCol = 'features')
df = vec.transform(df)
df = df.select("features", "Target")

(train, test) = df.randomSplit([0.8, 0.2])

print(df.count())

print(train.count())
print(test.count())

Letters in ‘Features’ represent numeric features, and *Vec elements represent OneHotEncoding vectors (created using pyspark’s OneHotEncoding() function).

When Spark arrives at print(train.count()), it starts the following exception:

Py4JJavaError: An error occurred while calling o2274.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 5 in stage 1521.0 failed 1 times, most recent failure: Lost task 
5.0 in stage 1521.0 (TID 122477, localhost, executor driver): 
java.lang.IllegalAccessError: tried to access field 
org.apache.spark.sql.execution.BufferedRowIterator.partitionIndex from 
.class 

Printing on df works great, so I think randomSplit somehow broke my data.

I

did a little test and if I remove any OneHotEncoding Vectors, it starts working for some reason. (For example, I removed “aVec” and it worked). The problem doesn’t seem to be related to a specific column, since I can remove any of them (if I use

Features = [‘aVec’, ‘bVec’, ‘cVec’] or Features = [‘bVec’, ‘cVec ‘, ‘dVec’] it will work, but not for Features = [‘aVec’, ‘bVec’, ‘cVec’, ‘dVec’]).

Is there a reason why I am getting this error?

Solution

I have the same problem, solved my problem by removing blank values from my data. I have several blank values in one of the input columns, they are not NA or NULL, but just a space: “”. This leads to the same error you described above. I filtered them out using raw_data = raw_data.filter('YourColumn != ""').

Hope this helps you too.

Related Problems and Solutions