Use Pandas to parse JSON columns with nested values in huge CSVs
I have a huge CSV file (3.5GB and getting bigger every day) with normal values and a column called “metadata” with nested JSON values. My script looks like this, just to convert a JSON column to a normal column for each key-value pair. I’m using Python3 (Anaconda; Windows)。
import pandas as pd
import numpy as np
import csv
import datetime as dt
from pandas.io.json import json_normalize
for df in pd.read_csv("source.csv", engine='c',
dayfirst=True,
encoding='utf-8',
header=0,
nrows=10,
chunksize=2,
converters={'Metadata':json.loads}):
## parsing code comes here
with open("output.csv", 'a', encoding='utf-8') as ofile:
df.to_csv(ofile, index=False, encoding='utf-8')
And the column has JSON in the following format: format
{
"content_id":"xxxx",
"parental":"F",
"my_custom_data":{
"GroupId":"NA",
"group":null,
"userGuid":"xxxxxxxxxxxxxx",
"deviceGuid":"xxxxxxxxxxxxx",
"connType":"WIFI",
"channelName":"VOD",
"assetId":"xxxxxxxxxxxxx",
"GroupName":"NA",
"playType":"VOD",
"appVersion":"2.1.0",
"userEnvironmentContext":"",
"vodEncode":"H.264",
"language":"English"
}
}
The desired output is to have all of the above key-value pairs as columns. The data frame will contain other non-JSON columns to which I need to add columns that are parsed from the above JSON. I tried json_normalize
, but I’m not sure how to apply json_normalize
to a Series object and then convert (or break it up) into multiple columns.
Solution
Use json_normalize()
directly against the series, and then use pandas.concat()
Merge a new data frame with an existing data frame:
pd.concat([df, json_normalize(df['Metadata'])])
If you no longer need the old column that contains the JSON data structure, you can add .drop('Metadata', axis=1).
The columns generated for my_custom_data
nested dictionaries will have the my_custom_data. prefix.
If all names in a nested dictionary are unique, you can use the DataFrame.rename()
operation removes the prefix. :
json_normalize(df['Metadata']).rename(
columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)
If you are using some other method to convert each dictionary value to a flat structure (for example, using flatten_json
then you want to use Series.apply()
processes each value, and then takes each generated dictionary as pandas. The Series()
object returns:
def some_conversion_function(dictionary):
result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
return pd. Series(something_that_processes_dictionary_into_a_flat_dict)
You can then concatenate the result of the Series.apply
() call (which will be a dataframe) back to your original dataframe:
pd.concat([df, df['Metadata'].apply(some_conversion_function)])