Python - How to conditionally remove character substrings from PySpark Dataframe StringType() columns based on the length of strings in a column?

How to conditionally remove character substrings from PySpark Dataframe StringType() columns based on the length of strings in a column?… here is a solution to the problem.

How to conditionally remove character substrings from PySpark Dataframe StringType() columns based on the length of strings in a column?

I have a PySpark data frame with a StringType() column that mainly contains 15 characters. However, some lines have 11 characters. Example:

df = 
+--------------+--------+
|             code|state|
+--------------+--------+
|' 334445532234553'|wa   |
|' 332452132234553'|mn   |
|' 45532234553'    |fl   |
|' 679645532234553'|mo   |
|' 918535532234553'|ar   |
|' 174925532234553'|wi   |
|' 45532234553'    |al   |
|' 928405532234553'|ca   |
+--------------+--------+

I need all rows to contain 11 characters and remove the last 15 characters from any line that contains 4 characters. So this is the output I want :

df.show(8) = 
+-------------+-----+
|         code|state|
+-------------+-----+
|' 33444553223'|wa   |
|' 33245213223'|mn   |
|' 45532234553'|fl   |
|' 67964553223'|mo   |
|' 91853553223'|ar   |
|' 17492553223'|wi   |
|' 45532234553'|al   |
|' 92840553223'|ca   |
+-------------+-----+

So far I’ve done this conversion and it removes the last 4 characters from all rows in the column named “code”:

from pyspark.sql.functions import substring, length, col, expr

df = df.withColumn("code",expr("substring(code, 1, length(code)-4)"))

So I need to do something to make this conditional on the length of the in-line string.

Edit With the help of @gmds, I found this solution:

df.withColumn("code",expr("substring(code, 1, 11)"))

Solution

How about this:

df.withColumn('code', df['code'].substr(1, 11))

You’re right; It’s just that when you really want a constant, you provide a variable value for the length of the substring.

Python – How to conditionally remove character substrings from PySpark Dataframe StringType() columns based on the length of strings in a column?

How to conditionally remove character substrings from PySpark Dataframe StringType() columns based on the length of strings in a column?

Solution

Related Problems and Solutions