Python - flattens the Spark Dataframe column of the map/dictionary into multiple columns

flattens the Spark Dataframe column of the map/dictionary into multiple columns… here is a solution to the problem.

flattens the Spark Dataframe column of the map/dictionary into multiple columns

We have a DataFrame that looks like this:

DataFrame[event: string, properties: map<string,string>]

Notice that there are two columns: event and properties. How can we split or flatten the properties column into multiple columns based on the key values in Map?

noticed that I could do something like this:

newDf = df.withColumn("foo", col("properties")["foo"])

Produces a Dataframe

DataFrame[event: string, properties: map<string,string>, foo: String]

But I have to do this one by one for all the keys. Is there a way to do them automatically? For example, if there is foo, bar, and baz as keys in properties, can we put map:

DataFrame[event: string, foo: String, bar: String, baz: String]

Solution

You can use the explode() function – it flattens map :by creating two extra columns for each entry – key and value

>>> df.printSchema()
root
 |-- event: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

>>> df.select('event', explode('properties')).printSchema()
root
 |-- event: string (nullable = true)
 |-- key: string (nullable = false)
 |-- value: string (nullable = true)

If you have a column with unique values that you can group by, you can use a pivot table. For example:

df.withColumn('id', monotonically_increasing_id()) \
    .select('id', 'event', explode('properties')) \
    .groupBy('id', 'event').pivot('key').agg(first('value'))

Python – flattens the Spark Dataframe column of the map/dictionary into multiple columns

flattens the Spark Dataframe column of the map/dictionary into multiple columns

Solution

Related Problems and Solutions