Reclassify columns in a Pandas data frame
I’m trying to build a simple classification model for data stored in a pandas dataframe train
. To make the model more efficient, I created a list of column names called category_cols
that I know are used to store categorical data. I categorize these columns as follows:
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')
# Convert train[category_cols] to a categorical type
train[category_cols] = train[category_cols].apply(categorize_label, axis=0)
My target variable, material
, is categorical with 64 unique labels that can be assigned to it. However, some of these labels only appear once in train, too few to train
the model well. Therefore, I want to filter any observations in train
that have these rare Material tags. This < a href="https://stackoverflow.com/questions/29836836/how-do-i-filter-a-pandas-dataframe-based-on-value-counts" rel="noreferrer noopener nofollow">answer provides a useful one groupby+filter combination:
print('Num rows: {}'.format(train.shape[0]))
print('Material labels: {}'.format(len(train['material'].unique())))
min_count = 5
filtered = train.groupby('material').filter(lambda x: len(x) > min_count)
print('Num rows: {}'.format(filtered.shape[0]))
print('Material labels: {}'.format(len(filtered['material'].unique())))
----------------------
Num rows: 19999
Material labels: 64
Num rows: 19963
Material labels: 45
This is very effective because it does filter observations with the Rare Material tag. However, something in the category
type seems to retain all the previous values of material
, even after they have been filtered. This becomes an issue when trying to create dummy variables, even if I try to rerun the same classification method :
filtered[category_cols] = filtered[category_cols].apply(categorize_label, axis=0)
print(pd.get_dummies(train['material']).shape)
print(pd.get_dummies(filtered['material']).shape)
----------------------
(19999, 64)
(19963, 64)
I thought the shape of the filtered dummy was (19963, 45). However, pd.get_dummies
includes the label column that does not appear in filtered
. I think this has something to do with how the category
type works. If so, can someone explain how to reclassify columns? Or, if that’s not possible, how do you remove unnecessary columns from filtered virtual objects?
Thanks!
Solution
You can use category.cat.remove_unused_categories
:
Usage
df['category'].cat.remove_unused_categories(inplace=True)
Example
df = pd. DataFrame({'label': list('aabbccd'),
'value': [1] * 7})
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
6 d 1
Let’s set label as the type category
df['label'] = df.label.astype('category')
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
6 d
Name: label, dtype: category
Categories (4, object): [a, b, c, d]
Filter the DataFrame
to remove the label
df = df[df.label.ne('d')]
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
Delete unused categories
df.label.cat.remove_unused_categories(inplace=True)
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
Name: label, dtype: category
Categories (3, object): [a, b, c]