Python – How to have a string value (such as ‘[title:item][‘ title2 :item]’. etc) is split into a dictionary with pandas

How to have a string value (such as ‘[title:item][‘ title2 :item]’. etc) is split into a dictionary with pandas… here is a solution to the problem.

How to have a string value (such as ‘[title:item][‘ title2 :item]’. etc) is split into a dictionary with pandas

I’m trying to clean up some data in a data frame. In particular, the following columns are displayed:

0    [Bean status:Whole][Type of Roast:Medium][Coff...
1    [Type of Roast:Espresso][Coffee Type:Blend]
2    [Bean status:Whole][Type of Roast:Dark][Coffee...
3    [Bean status:Whole][Type of Roast:Light][Coffe...
4                                                  NaN
5    [Roaster:Little City][Type of Roast:Light][Cof...

Name: options, dtype: object

My goal is to split it into four columns and assign the corresponding values to the columns as follows:

     Roaster    Bean Status    Type of Roast    Coffee Type
0    NaN        Whole          Medium           Blend
1    NaN        NaN            Espresso         Blend
..
5    Littl...   Whole          Light            Single Origin

I tried df.str.split('[', expand=True) but it didn’t work because the options weren’t always present or in the same location.

My idea is to try splitting the string into a dictionary and storing that dictionary in a new data frame, and then concatenating the two data frames together. However, I got lost trying to store the column into a dictionary. I’ve tried this: https://www.fir3net.com/Programming/Python/python-split-a-string-into-a-dictionary.html like this:

roasts = {}
roasts = dict(x.split(':') for x in df['options'][0].split('[]'))
print(roasts)

I get this error :

ValueError: dictionary update sequence element #0 has length 4; 2 is required

I tried investigating what’s going on here by saving to a list :

s = ([x.split(':') for x in df['options'][0].split('[]')])
print(s)

[['[Bean status', 'Whole][Type of Roast', 'Medium][Coffee Type', 'Blend]']]

So I see that the code doesn’t

split the string the way I want it to, and tries to replace single parentheses with those different positions, but doesn’t get the right result.

Is it possible to put this column in a dictionary, or do I have to resort to regular expressions?

Solution

Sample data using AmiTavory

df = pd. DataFrame(dict(options=[
    '[Bean status:Whole][Type of Roast:Medium]', 
    '[Type of Roast:Espresso][Coffee Type:Blend]'
]))

A combination of re.findall and str.split

import re
import pandas as pd

pd. DataFrame([
    dict(
        x.split(':')
        for x in re.findall('\[(.*?) \]', v)
    )
    for v in df.options
])

Bean status Coffee Type Type of Roast
0       Whole         NaN        Medium
1         NaN       Blend      Espresso

Related Problems and Solutions