Python – Use Pandas to parse JSON columns with nested values in huge CSVs

Use Pandas to parse JSON columns with nested values in huge CSVs… here is a solution to the problem.

Use Pandas to parse JSON columns with nested values in huge CSVs

I have a huge CSV file (3.5GB and getting bigger every day) with normal values and a column called “metadata” with nested JSON values. My script looks like this, just to convert a JSON column to a normal column for each key-value pair. I’m using Python3 (Anaconda; Windows)。

import pandas as pd
import numpy as np
import csv
import datetime as dt

from pandas.io.json import json_normalize

for df in pd.read_csv("source.csv", engine='c', 
    dayfirst=True, 
    encoding='utf-8', 
    header=0,
    nrows=10,
    chunksize=2,
    converters={'Metadata':json.loads}):

## parsing code comes here

with open("output.csv", 'a', encoding='utf-8') as ofile:
        df.to_csv(ofile, index=False, encoding='utf-8')

And the column has JSON in the following format: format

{  
   "content_id":"xxxx",
   "parental":"F",
   "my_custom_data":{  
      "GroupId":"NA",
      "group":null,
      "userGuid":"xxxxxxxxxxxxxx",
      "deviceGuid":"xxxxxxxxxxxxx",
      "connType":"WIFI",
      "channelName":"VOD",
      "assetId":"xxxxxxxxxxxxx",
      "GroupName":"NA",
      "playType":"VOD",
      "appVersion":"2.1.0",
      "userEnvironmentContext":"",
      "vodEncode":"H.264",
      "language":"English"
   }
}

The desired output is to have all of the above key-value pairs as columns. The data frame will contain other non-JSON columns to which I need to add columns that are parsed from the above JSON. I tried json_normalize, but I’m not sure how to apply json_normalize to a Series object and then convert (or break it up) into multiple columns.

Solution

Use json_normalize() directly against the series, and then use pandas.concat() Merge a new data frame with an existing data frame:

pd.concat([df, json_normalize(df['Metadata'])])

If you no longer need the old column that contains the JSON data structure, you can add .drop('Metadata', axis=1).

The columns generated for my_custom_data nested dictionaries will have the my_custom_data. prefix. If all names in a nested dictionary are unique, you can use the DataFrame.rename() operation removes the prefix. :

json_normalize(df['Metadata']).rename(
    columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)

If you are using some other method to convert each dictionary value to a flat structure (for example, using flatten_json then you want to use Series.apply() processes each value, and then takes each generated dictionary as pandas. The Series() object returns:

def some_conversion_function(dictionary):
    result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
    return pd. Series(something_that_processes_dictionary_into_a_flat_dict)

You can then concatenate the result of the Series.apply() call (which will be a dataframe) back to your original dataframe:

pd.concat([df, df['Metadata'].apply(some_conversion_function)])

Related Problems and Solutions