Linux – How to format with multiple commas in a string field . Date fields for CSV files

How to format with multiple commas in a string field . Date fields for CSV files… here is a solution to the problem.

How to format with multiple commas in a string field . Date fields for CSV files

I have a . CSV file (file.csv whose data is all enclosed in double quotation marks. An example of a file format is as follows:

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","13-OCT-11","232"

The 9th field is a date field in the format “DD-MMM-YY”. I had to convert it to YYYY/MM/DD format. I’m trying the following code, but it doesn’t work.

awk -F, '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/"20"%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 print
 }' OFS="," file.csv > temp_file.csv

After executing the above code, the output of the file temp_file.csv is shown below.

column1,column2,column3,column4,column5,column6,column7,Column8,00/00/2000,Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

As far as I can tell, the problem is with commas in double quotes, since my code also takes them into account… Please make suggestions on the following questions:

1) Is there any difference between all values in all fields in double quotes? If they differ in any way, how do I remove them from all values except the string with a comma in it?
2) Make any modifications to my code so that I can format the 9th field formatted as “DD-MMM-YYYY” as YYYY/MM/DD

Solution

I highly recommend using the correct CSV parser. For example, using Text::CSV_XS will do the job in the right and sensible way in Perl. For example this single line:

perl -MText::CSV_XS -E'$csv=Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1}); @m=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); @m{@m}=(1 .. @m); while(my $row=$csv->getline(ARGV)){($d,$m,$y)=split("-",$row->[8]); $row->[8]=sprintf"%02d/%02d/%04d",$d,$m{$m},$y if $m{$m}; $csv->print(STDOUT, $row)}' file.csv > temp_file.csv

Related Problems and Solutions