When executing .astype ‘b’ on a column), Pandas to_csv is prefixed with (‘| S’
I’m following the advice of this article To reduce Pandas DataFrame memory usage, I use . astype('| S'),
as shown below:
data_frame['COLUMN1'] = data_frame['COLUMN1'].astype('| S')
data_frame['COLUMN2'] = data_frame['COLUMN2'].astype('| S')
Doing this on a DataFrame reduces memory usage by 20-40% without negatively affecting processing columns. However, when using .to_csv()
to output the file:
data_frame.to_csv(filename, sep='\t', encoding='utf-8')
With .astype('| S') column
output with the b prefix and single quotes:
b'00001234' b'Source'
Delete .astype('| S')
Call and output to csv gives the expected behavior:
00001234 Source
Some Google searches on this issue did find GitHub issues, but I don’t think they’re relevant (it looks like they’ve been fixed too): to_csv and bytes on Python 3 , BUG: Fix default encoding for CSVFormatter.save
I’m using Python 3.6.4 and Pandas 0.22.0. I tested that the behavior is consistent on both MacOS and Windows. Any suggestions on how to output columns without b prefixes and single quotes?
Solution
The ‘b’ prefix indicates Python 3 bytes literal Represents an object instead of a Unicode string. So, if you want to remove the prefix, you can decode the byte object using the string decoding method before saving it to a csv file:
data_frame['COLUMN1'] = data_frame['COLUMN1'].apply(lambda s: s.decode('utf-8'))