Python – Use Pandas to parse JSON columns with nested values in huge CSVs

Use Pandas to parse JSON columns with nested values in huge CSVs… here is a solution to the problem.

Use Pandas to parse JSON columns with nested values in huge CSVs

I have a huge CSV file (3.5GB and getting bigger every day) with normal values and a column called “metadata” with nested JSON values. My script looks like this, just to convert a JSON column to a normal column for each key-value pair. I’m using Python3 (Anaconda; Windows)。

import pandas as pd
import numpy as np
import csv
import datetime as dt

from import json_normalize

for df in pd.read_csv("source.csv", engine='c', 

## parsing code comes here

with open("output.csv", 'a', encoding='utf-8') as ofile:
        df.to_csv(ofile, index=False, encoding='utf-8')

And the column has JSON in the following format: format


The desired output is to have all of the above key-value pairs as columns. The data frame will contain other non-JSON columns to which I need to add columns that are parsed from the above JSON. I tried json_normalize, but I’m not sure how to apply json_normalize to a Series object and then convert (or break it up) into multiple columns.


Use json_normalize() directly against the series, and then use pandas.concat() Merge a new data frame with an existing data frame:

pd.concat([df, json_normalize(df['Metadata'])])

If you no longer need the old column that contains the JSON data structure, you can add .drop('Metadata', axis=1).

The columns generated for my_custom_data nested dictionaries will have the my_custom_data. prefix. If all names in a nested dictionary are unique, you can use the DataFrame.rename() operation removes the prefix. :

    columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)

If you are using some other method to convert each dictionary value to a flat structure (for example, using flatten_json then you want to use Series.apply() processes each value, and then takes each generated dictionary as pandas. The Series() object returns:

def some_conversion_function(dictionary):
    result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
    return pd. Series(something_that_processes_dictionary_into_a_flat_dict)

You can then concatenate the result of the Series.apply() call (which will be a dataframe) back to your original dataframe:

pd.concat([df, df['Metadata'].apply(some_conversion_function)])

Related Problems and Solutions