CsvSplit
July 30, 2019
I used a finite-state machine to do the splitting. The possible states are START at the beginning of a new field, INFIELD when within a non-quoted field, INQUOTES when within a quoted field, and MAYBEDOUBLE when the current state is INQUOTES and a double-quote is seen. Here is the program, complete with a test harness:
$ gawk '
> BEGIN { while (getline str > 0) printfields(str) }
>
> function printfields(str) {
> for (k=1; k print k, arr[k] }
>
> function csvsplit(str, arr, i,j,n,s,fs,qt) {
> # split comma-separated fields into arr
> # return number of fields in arr
> # in Europe, change fs from "," to ";"
> # fields surrounded by double-quotes may
> # contain commas; doubled double-quotes
> # represent a single embedded quote
> delete arr
> s = "START"; n = 0; fs = ","; qt = "\""
> for (i = 1; i if (s == "START") {
> if (substr(str,i,1) == fs) {
> arr[++n] = "" }
> else if (substr(str,i,1) == qt) {
> j = i+1; s = "INQUOTES" }
> else { j = i; s = "INFIELD" } }
> else if (s == "INFIELD") {
> if (substr(str,i,1) == fs) {
> arr[++n] = substr(str,j,i-j)
> j = 0; s = "START" } }
> else if (s == "INQUOTES") {
> if (substr(str,i,1) == qt) {
> s = "MAYBEDOUBLE" } }
> else if (s == "MAYBEDOUBLE") {
> if (substr(str,i,1) == fs) {
> arr[++n] = substr(str,j,i-j-1)
> gsub(qt qt, qt, arr[n])
> j = 0; s = "START" } } }
> if (s == "INFIELD" || s == "INQUOTES") {
> arr[++n] = substr(str,j) }
> else if (s == "MAYBEDOUBLE") {
> arr[++n] = substr(str,j,length(str)-j)
> gsub(qt qt, qt, arr[n]) }
> else if (s == "START") { arr[++n] = "" }
> return n }
> '
abc,123,"de""fgh",456,"ij,klm"
1 abc
2 123
3 de"fgh
4 456
5 ij,klm
"ij,klm","de""fgh"
1 ij,klm
2 de"fgh
CTRL-D
As in the past, I had trouble with WordPress and comparison operators, so look at the ideone source if that doesn’t come out right. You can run the program at https://ideone.com/5VvGoH.
The only problem is that newlines are allowed within a double-quoted field, at least by some programs as well as by RFC 4180, the nearest thing to a standard. So awk’s line-by-line model really doesn’t work without great pain.
That’s correct. If you need that functionality, the previous exercise linked in the task description provides it. But the current exercise provides a function that is useful in a large percentage of cases.