CsvSplit

July 30, 2019

I used a finite-state machine to do the splitting. The possible states are START at the beginning of a new field, INFIELD when within a non-quoted field, INQUOTES when within a quoted field, and MAYBEDOUBLE when the current state is INQUOTES and a double-quote is seen. Here is the program, complete with a test harness:

$ gawk '
> BEGIN { while (getline str > 0) printfields(str) }
>
> function printfields(str) {
>     for (k=1; k         print k, arr[k] }
>
> function csvsplit(str, arr,     i,j,n,s,fs,qt) {
>     # split comma-separated fields into arr
>     # return number of fields in arr
>     # in Europe, change fs from "," to ";"
>     # fields surrounded by double-quotes may
>     #   contain commas; doubled double-quotes
>     #   represent a single embedded quote
>     delete arr
>     s = "START"; n = 0; fs = ","; qt = "\""
>     for (i = 1; i          if (s == "START") {
>             if (substr(str,i,1) == fs) {
>                 arr[++n] = "" }
>             else if (substr(str,i,1) == qt) {
>                 j = i+1; s = "INQUOTES" }
>             else { j = i; s = "INFIELD" } }
>         else if (s == "INFIELD") {
>             if (substr(str,i,1) == fs) {
>                 arr[++n] = substr(str,j,i-j)
>                 j = 0; s = "START" } }
>         else if (s == "INQUOTES") {
>             if (substr(str,i,1) == qt) {
>                 s = "MAYBEDOUBLE" } }
>         else if (s == "MAYBEDOUBLE") {
>             if (substr(str,i,1) == fs) {
>                 arr[++n] = substr(str,j,i-j-1)
>                 gsub(qt qt, qt, arr[n])
>                 j = 0; s = "START" } } }
>     if (s == "INFIELD" || s == "INQUOTES") {
>         arr[++n] = substr(str,j) }
>     else if (s == "MAYBEDOUBLE") {
>         arr[++n] = substr(str,j,length(str)-j)
>         gsub(qt qt, qt, arr[n]) }
>     else if (s == "START") { arr[++n] = "" }
>     return n }
> '
abc,123,"de""fgh",456,"ij,klm"
1 abc
2 123
3 de"fgh
4 456
5 ij,klm
"ij,klm","de""fgh"
1 ij,klm
2 de"fgh
CTRL-D

As in the past, I had trouble with WordPress and comparison operators, so look at the ideone source if that doesn’t come out right. You can run the program at https://ideone.com/5VvGoH.

Advertisement

Pages: 1 2

2 Responses to “CsvSplit”

  1. John Cowan said

    The only problem is that newlines are allowed within a double-quoted field, at least by some programs as well as by RFC 4180, the nearest thing to a standard. So awk’s line-by-line model really doesn’t work without great pain.

  2. programmingpraxis said

    That’s correct. If you need that functionality, the previous exercise linked in the task description provides it. But the current exercise provides a function that is useful in a large percentage of cases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: