Python – Pandas reads malformed CSV

Pandas reads malformed CSV… here is a solution to the problem.

Pandas reads malformed CSV

I received a CSV file where , is the delimiter used to separate the fields, but unfortunately appended the sign (German notation) for the decimal point.

Therefore, some rows will have different numbers of columns. Oddly enough, excel parses/reads files nicely. Is it also possible to read such files in Pindas? So far, I’ve only gotten something like

Error tokenizing data. C error: Expected 97 fields in line 3, saw 98

Edit

Here’s a minimal example:

pd.read_csv(os.path.expanduser('~/Downloads/foo.csv'), sep=',', decimal=',')

~/Downloads/foo.csv The file content is

first, number, third
some, 1, other
foo, 1.5, bar
baz, 1,5, some

When I load data in R

See spec(...) for full column specifications.
Warnung: 1538 parsing failures.
row col   expected      actual
  1  -- 93 columns 97 columns 
  2  -- 93 columns 98 columns 
  3  -- 93 columns 97 columns 
  4  -- 93 columns 102 columns
  5  -- 93 columns 99 columns 

Does pandas have this model of tolerance?

Solution

Make sure that there are no quote delimiters in your file that you should declare to read_csv.

If your file format is incorrect, there is no mathematical deterministic algorithm that can determine whether a series of characters with commas is two fields or only one comma-separated number.

You will have to write a preprocessor that uses a temporary algorithm close to the actual condition of the file to clean up malformed data. This can be bad, i.e. I’m assuming that a number followed by a comma followed by 3 digits is actually the same field and any other variation of these fixes.

You may also encounter a situation where even then you are not sure, then you have no choice but to go to the data source and request a data repair in another file format.

To remove the wrong line and load other lines, these parameters in the documentation will help:

error_bad_lines : boolean, default True Lines with too many fields
(e.g. a csv line with too many commas) will by default cause an
exception to be raised, and no DataFrame will be returned. If False,
then these “bad lines” will dropped from the DataFrame that is
returned. (Only valid with C parser)

warn_bad_lines : boolean, default
True If error_bad_lines is False, and warn_bad_lines is True, a
warning for each “bad line” will be output. (Only valid with C
parser).

Related Problems and Solutions