Python – Reclassify columns in a Pandas data frame

Reclassify columns in a Pandas data frame… here is a solution to the problem.

Reclassify columns in a Pandas data frame

I’m trying to build a simple classification model for data stored in a pandas dataframe train. To make the model more efficient, I created a list of column names called category_cols that I know are used to store categorical data. I categorize these columns as follows:

# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert train[category_cols] to a categorical type
train[category_cols] = train[category_cols].apply(categorize_label, axis=0)

My target variable, material, is categorical with 64 unique labels that can be assigned to it. However, some of these labels only appear once in train, too few to train the model well. Therefore, I want to filter any observations in train that have these rare Material tags. This < a href="https://stackoverflow.com/questions/29836836/how-do-i-filter-a-pandas-dataframe-based-on-value-counts" rel="noreferrer noopener nofollow">answer provides a useful one groupby+filter combination:

print('Num rows: {}'.format(train.shape[0]))
print('Material labels: {}'.format(len(train['material'].unique())))

min_count = 5
filtered = train.groupby('material').filter(lambda x: len(x) > min_count)
print('Num rows: {}'.format(filtered.shape[0]))
print('Material labels: {}'.format(len(filtered['material'].unique())))
----------------------
Num rows: 19999
Material labels: 64
Num rows: 19963
Material labels: 45

This is very effective because it does filter observations with the Rare Material tag. However, something in the category type seems to retain all the previous values of material, even after they have been filtered. This becomes an issue when trying to create dummy variables, even if I try to rerun the same classification method :

filtered[category_cols] = filtered[category_cols].apply(categorize_label, axis=0)
print(pd.get_dummies(train['material']).shape)
print(pd.get_dummies(filtered['material']).shape)
----------------------
(19999, 64)
(19963, 64)

I thought the shape of the filtered dummy was (19963, 45). However, pd.get_dummies includes the label column that does not appear in filtered. I think this has something to do with how the category type works. If so, can someone explain how to reclassify columns? Or, if that’s not possible, how do you remove unnecessary columns from filtered virtual objects?

Thanks!

Solution

You can use category.cat.remove_unused_categories :

Usage

df['category'].cat.remove_unused_categories(inplace=True)

Example

df = pd. DataFrame({'label': list('aabbccd'),
                   'value': [1] * 7})
print(df)

label  value
0     a      1
1     a      1
2     b      1
3     b      1
4     c      1
5     c      1
6     d      1

Let’s set label as the type category

df['label'] = df.label.astype('category')
print(df.label)

0    a
1    a
2    b
3    b
4    c
5    c
6    d
Name: label, dtype: category
Categories (4, object): [a, b, c, d]

Filter the DataFrame to remove the label

df = df[df.label.ne('d')]
print(df)

label  value
0     a      1
1     a      1
2     b      1
3     b      1
4     c      1
5     c      1

Delete unused categories

df.label.cat.remove_unused_categories(inplace=True)
print(df.label)

0    a
1    a
2    b
3    b
4    c
5    c
Name: label, dtype: category
Categories (3, object): [a, b, c]

Related Problems and Solutions