Python – pd.get_dummies DataFrames are the same size at Sparse = True as they are when Sparse = False

pd.get_dummies DataFrames are the same size at Sparse = True as they are when Sparse = False… here is a solution to the problem.

pd.get_dummies DataFrames are the same size at Sparse = True as they are when Sparse = False

I

have a data frame with multiple string columns and I want to convert it to categorical data so that I can run some models and extract important features from them.

However, due to the number of

unique values, single-hot encoded data scales to a large number of columns, causing performance issues.

To solve this problem, I’m experimenting with the Sparse = True parameter in get_dummies.

test1 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000))
test2 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000),sparse = True)

However, when I check the information of my two comparators, they take up the same amount of memory. The space used by Sparse = True does not seem to be decreasing. Why is that?

test1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries,...
dtypes: uint8(2253)
memory usage: 21.6 MB

test2.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries, ...
dtypes: uint8(2253)
memory usage: 21.9 MB

Solution

I look at the source of Pandas get_dummies, But so far no error has been found. Here’s a little experiment I did below (the first half is to reproduce your problem with real data).

In [1]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: a = ['a', 'b'] * 100000
   ...: A = ['A', 'B'] * 100000
   ...: 
   ...: df1 = pd. DataFrame({'a': a, 'A': A})
   ...: df1 = pd.get_dummies(df1)
   ...: df1.info()
   ...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB

In [2]: df2 = pd. DataFrame({'a': a, 'A': A})
   ...: df2 = pd.get_dummies(df2, sparse=True)
   ...: df2.info()
   ...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB

Same result as you so far (the size of df1 is equal to the size of df2), but if I explicitly convert df2 using to_sparse and fill_value=0 to

sparse

In [3]: df2 = df2.to_sparse(fill_value=0)
   ...: df2.info()
   ...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 390.7 KB

Now the memory usage is halved because half of the data is 0.

In conclusion, I’m not sure why get_dummies(sparse=True) doesn’t compress the dataframe even when converted to SparseDataFrame, but there’s a workaround. The discussion took place on GitHubget_dummies with sparse doesn’t convert numeric to sparse But the conclusion still seems to be up in the air.

Related Problems and Solutions