CsvSplit
July 30, 2019
I used a finite-state machine to do the splitting. The possible states are START at the beginning of a new field, INFIELD when within a non-quoted field, INQUOTES when within a quoted field, and MAYBEDOUBLE when the current state is INQUOTES and a double-quote is seen. Here is the program, complete with a test harness:
$ gawk ' > BEGIN { while (getline str > 0) printfields(str) } > > function printfields(str) { > for (k=1; k print k, arr[k] } > > function csvsplit(str, arr, i,j,n,s,fs,qt) { > # split comma-separated fields into arr > # return number of fields in arr > # in Europe, change fs from "," to ";" > # fields surrounded by double-quotes may > # contain commas; doubled double-quotes > # represent a single embedded quote > delete arr > s = "START"; n = 0; fs = ","; qt = "\"" > for (i = 1; i if (s == "START") { > if (substr(str,i,1) == fs) { > arr[++n] = "" } > else if (substr(str,i,1) == qt) { > j = i+1; s = "INQUOTES" } > else { j = i; s = "INFIELD" } } > else if (s == "INFIELD") { > if (substr(str,i,1) == fs) { > arr[++n] = substr(str,j,i-j) > j = 0; s = "START" } } > else if (s == "INQUOTES") { > if (substr(str,i,1) == qt) { > s = "MAYBEDOUBLE" } } > else if (s == "MAYBEDOUBLE") { > if (substr(str,i,1) == fs) { > arr[++n] = substr(str,j,i-j-1) > gsub(qt qt, qt, arr[n]) > j = 0; s = "START" } } } > if (s == "INFIELD" || s == "INQUOTES") { > arr[++n] = substr(str,j) } > else if (s == "MAYBEDOUBLE") { > arr[++n] = substr(str,j,length(str)-j) > gsub(qt qt, qt, arr[n]) } > else if (s == "START") { arr[++n] = "" } > return n } > ' abc,123,"de""fgh",456,"ij,klm" 1 abc 2 123 3 de"fgh 4 456 5 ij,klm "ij,klm","de""fgh" 1 ij,klm 2 de"fgh CTRL-D
As in the past, I had trouble with WordPress and comparison operators, so look at the ideone source if that doesn’t come out right. You can run the program at https://ideone.com/5VvGoH.
The only problem is that newlines are allowed within a double-quoted field, at least by some programs as well as by RFC 4180, the nearest thing to a standard. So awk’s line-by-line model really doesn’t work without great pain.
That’s correct. If you need that functionality, the previous exercise linked in the task description provides it. But the current exercise provides a function that is useful in a large percentage of cases.