flattens the Spark Dataframe column of the map/dictionary into multiple columns… here is a solution to the problem.
flattens the Spark Dataframe column of the map/dictionary into multiple columns
We have a DataFrame
that looks like this:
DataFrame[event: string, properties: map<string,string>]
Notice that there are two columns: event
and properties
. How can we split or flatten the properties
column into multiple columns based on the key values in Map
?
I
noticed that I could do something like this:
newDf = df.withColumn("foo", col("properties")["foo"])
Produces a Dataframe
DataFrame[event: string, properties: map<string,string>, foo: String]
But I have to do this one by one for all the keys. Is there a way to do them automatically? For example, if there is foo
, bar
, and baz
as keys in properties
, can we put map
:
DataFrame[event: string, foo: String, bar: String, baz: String]
Solution
You can use the explode()
function – it flattens map :by creating two extra columns for each entry – key
and value
>>> df.printSchema()
root
|-- event: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
>>> df.select('event', explode('properties')).printSchema()
root
|-- event: string (nullable = true)
|-- key: string (nullable = false)
|-- value: string (nullable = true)
If you have a column with unique values that you can group by, you can use a pivot table. For example:
df.withColumn('id', monotonically_increasing_id()) \
.select('id', 'event', explode('properties')) \
.groupBy('id', 'event').pivot('key').agg(first('value'))