How does Pandas.read_csv type conversion work?
Using pandas.read_csv
with parse_dates
options and a custom date parser, I found that Pandas has its own idea of the type of data it’s reading.
Example csv:
"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
The actual date cleaner is here, but what I did boiled down to:
import pandas as pd
def dateclean(date):
return str(int(date)) # Note: we return A STRING
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
print(df.birth_date)
Output:
0 NaN
1 1625.0
2 1533.0
Name: birth_date, dtype: float64
When I specify str
, I get float64
type, even. Also, taking out the first row in the CSV, the one where birth_date is empty, I get the type int
. The solution is simple:
return '"{}"'.format(int(date))
Is there a better way?
In data analysis, I can imagine that it’s useful for Pandas to say “Hey man, you thought you were reading strings, but in reality they’re numbers”. But what was the reason for vetoing me when I told it not to do so?
Solution
Using parse_dates
/date_parser
seems a bit complicated to me, unless you want to generalize your import on many date columns. I think you can have more control with the converters
parameter, where you can install the dateclean()
function. You can also try using the dtype
parameter.
The problem with the original dateclean()
function is that it fails on the
“” value because int("")
throws a value error
. Pandas seems to resort to standard import when it encounters this issue, but it explicitly fails due to converters
.
Here’s the code that demonstrates the fix:
import pandas as pd
from pathlib import Path
doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""
Path('my.csv').write_text(doc)
def dateclean(date):
try:
return str(int(date))
except ValueError:
return ''
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
df2 = pd.read_csv(
'my.csv',
converters = {'birth_date': dateclean}
)
print(df2.birth_date)
Hope this helps.