Python – How does Pandas.read_csv type conversion work?

How does Pandas.read_csv type conversion work?… here is a solution to the problem.

How does Pandas.read_csv type conversion work?

Using pandas.read_csv with parse_dates options and a custom date parser, I found that Pandas has its own idea of the type of data it’s reading.

Example csv:

"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"

The actual date cleaner is here, but what I did boiled down to:

import pandas as pd

def dateclean(date):
    return str(int(date)) # Note: we return A STRING

df = pd.read_csv(
        'my.csv', 
        parse_dates=['birth_date'],
        date_parser=dateclean, 
        engine='python'
        )

print(df.birth_date)       

Output:

0       NaN
1    1625.0
2    1533.0
Name: birth_date, dtype: float64

When I specify str, I get float64 type, even. Also, taking out the first row in the CSV, the one where birth_date is empty, I get the type int. The solution is simple:

return '"{}"'.format(int(date))

Is there a better way?

In data analysis, I can imagine that it’s useful for Pandas to say “Hey man, you thought you were reading strings, but in reality they’re numbers”. But what was the reason for vetoing me when I told it not to do so?

Solution

Using parse_dates/date_parser seems a bit complicated to me, unless you want to generalize your import on many date columns. I think you can have more control with the converters parameter, where you can install the dateclean() function. You can also try using the dtype parameter.

The problem with the original dateclean() function is that it fails on the “” value because int("") throws a value error. Pandas seems to resort to standard import when it encounters this issue, but it explicitly fails due to converters.

Here’s the code that demonstrates the fix:

import pandas as pd
from pathlib import Path

doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""

Path('my.csv').write_text(doc)

def dateclean(date):
    try: 
       return str(int(date)) 
    except ValueError:
       return '' 

df = pd.read_csv(
        'my.csv', 
        parse_dates=['birth_date'],
        date_parser=dateclean, 
        engine='python'
        )

df2 = pd.read_csv(
        'my.csv', 
        converters = {'birth_date': dateclean}
        )

print(df2.birth_date)  

Hope this helps.

Related Problems and Solutions