How to conditionally remove character substrings from PySpark Dataframe StringType() columns based on the length of strings in a column?… here is a solution to the problem.
How to conditionally remove character substrings from PySpark Dataframe StringType() columns based on the length of strings in a column?
I have a PySpark data frame with a StringType() column that mainly contains 15 characters. However, some lines have 11 characters. Example:
df =
+--------------+--------+
| code|state|
+--------------+--------+
|' 334445532234553'|wa |
|' 332452132234553'|mn |
|' 45532234553' |fl |
|' 679645532234553'|mo |
|' 918535532234553'|ar |
|' 174925532234553'|wi |
|' 45532234553' |al |
|' 928405532234553'|ca |
+--------------+--------+
I need all rows to contain 11 characters and remove the last 15 characters from any line that contains 4 characters. So this is the output I want :
df.show(8) =
+-------------+-----+
| code|state|
+-------------+-----+
|' 33444553223'|wa |
|' 33245213223'|mn |
|' 45532234553'|fl |
|' 67964553223'|mo |
|' 91853553223'|ar |
|' 17492553223'|wi |
|' 45532234553'|al |
|' 92840553223'|ca |
+-------------+-----+
So far I’ve done this conversion and it removes the last 4 characters from all rows in the column named “code”:
from pyspark.sql.functions import substring, length, col, expr
df = df.withColumn("code",expr("substring(code, 1, length(code)-4)"))
So I need to do something to make this conditional on the length of the in-line string.
Edit With the help of @gmds, I found this solution:
df.withColumn("code",expr("substring(code, 1, 11)"))
Solution
How about this:
df.withColumn('code', df['code'].substr(1, 11))
You’re right; It’s just that when you really want a constant, you provide a variable value for the length of the substring.