Python – When executing .astype ‘b’ on a column), Pandas to_csv is prefixed with (‘| S’

When executing .astype ‘b’ on a column), Pandas to_csv is prefixed with (‘| S’… here is a solution to the problem.

When executing .astype ‘b’ on a column), Pandas to_csv is prefixed with (‘| S’

I’m following the advice of this article To reduce Pandas DataFrame memory usage, I use . astype('| S'), as shown below:

data_frame['COLUMN1'] = data_frame['COLUMN1'].astype('| S')
data_frame['COLUMN2'] = data_frame['COLUMN2'].astype('| S')

Doing this on a DataFrame reduces memory usage by 20-40% without negatively affecting processing columns. However, when using .to_csv() to output the file:

data_frame.to_csv(filename, sep='\t', encoding='utf-8')

With .astype('| S') column output with the b prefix and single quotes:

b'00001234'  b'Source'

Delete .astype('| S') Call and output to csv gives the expected behavior:

00001234  Source

Some Google searches on this issue did find GitHub issues, but I don’t think they’re relevant (it looks like they’ve been fixed too): to_csv and bytes on Python 3 , BUG: Fix default encoding for CSVFormatter.save

I’m using Python 3.6.4 and Pandas 0.22.0. I tested that the behavior is consistent on both MacOS and Windows. Any suggestions on how to output columns without b prefixes and single quotes?

Solution

The ‘b’ prefix indicates Python 3 bytes literal Represents an object instead of a Unicode string. So, if you want to remove the prefix, you can decode the byte object using the string decoding method before saving it to a csv file:

data_frame['COLUMN1'] = data_frame['COLUMN1'].apply(lambda s: s.decode('utf-8'))

Related Problems and Solutions