How to have a string value (such as ‘[title:item][‘ title2 :item]’. etc) is split into a dictionary with pandas
I’m trying to clean up some data in a data frame. In particular, the following columns are displayed:
0 [Bean status:Whole][Type of Roast:Medium][Coff...
1 [Type of Roast:Espresso][Coffee Type:Blend]
2 [Bean status:Whole][Type of Roast:Dark][Coffee...
3 [Bean status:Whole][Type of Roast:Light][Coffe...
4 NaN
5 [Roaster:Little City][Type of Roast:Light][Cof...
Name: options, dtype: object
My goal is to split it into four columns and assign the corresponding values to the columns as follows:
Roaster Bean Status Type of Roast Coffee Type
0 NaN Whole Medium Blend
1 NaN NaN Espresso Blend
..
5 Littl... Whole Light Single Origin
I tried df.str.split('[', expand=True)
but it didn’t work because the options weren’t always present or in the same location.
My idea is to try splitting the string into a dictionary and storing that dictionary in a new data frame, and then concatenating the two data frames together. However, I got lost trying to store the column into a dictionary. I’ve tried this: https://www.fir3net.com/Programming/Python/python-split-a-string-into-a-dictionary.html like this:
roasts = {}
roasts = dict(x.split(':') for x in df['options'][0].split('[]'))
print(roasts)
I get this error :
ValueError: dictionary update sequence element #0 has length 4; 2 is required
I tried investigating what’s going on here by saving to a list :
s = ([x.split(':') for x in df['options'][0].split('[]')])
print(s)
[['[Bean status', 'Whole][Type of Roast', 'Medium][Coffee Type', 'Blend]']]
So I see that the code doesn’t
split the string the way I want it to, and tries to replace single parentheses with those different positions, but doesn’t get the right result.
Is it possible to put this column in a dictionary, or do I have to resort to regular expressions?
Solution
Sample data using AmiTavory
df = pd. DataFrame(dict(options=[
'[Bean status:Whole][Type of Roast:Medium]',
'[Type of Roast:Espresso][Coffee Type:Blend]'
]))
A combination of re.findall
and str.split
import re
import pandas as pd
pd. DataFrame([
dict(
x.split(':')
for x in re.findall('\[(.*?) \]', v)
)
for v in df.options
])
Bean status Coffee Type Type of Roast
0 Whole NaN Medium
1 NaN Blend Espresso