CsvSplit

July 30, 2019

I used a finite-state machine to do the splitting. The possible states are START at the beginning of a new field, INFIELD when within a non-quoted field, INQUOTES when within a quoted field, and MAYBEDOUBLE when the current state is INQUOTES and a double-quote is seen. Here is the program, complete with a test harness:

$ gawk '
> BEGIN { while (getline str > 0) printfields(str) }
>
> function printfields(str) {
>     for (k=1; k         print k, arr[k] }
>
> function csvsplit(str, arr,     i,j,n,s,fs,qt) {
>     # split comma-separated fields into arr
>     # return number of fields in arr
>     # in Europe, change fs from "," to ";"
>     # fields surrounded by double-quotes may
>     #   contain commas; doubled double-quotes
>     #   represent a single embedded quote
>     delete arr
>     s = "START"; n = 0; fs = ","; qt = "\""
>     for (i = 1; i          if (s == "START") {
>             if (substr(str,i,1) == fs) {
>                 arr[++n] = "" }
>             else if (substr(str,i,1) == qt) {
>                 j = i+1; s = "INQUOTES" }
>             else { j = i; s = "INFIELD" } }
>         else if (s == "INFIELD") {
>             if (substr(str,i,1) == fs) {
>                 arr[++n] = substr(str,j,i-j)
>                 j = 0; s = "START" } }
>         else if (s == "INQUOTES") {
>             if (substr(str,i,1) == qt) {
>                 s = "MAYBEDOUBLE" } }
>         else if (s == "MAYBEDOUBLE") {
>             if (substr(str,i,1) == fs) {
>                 arr[++n] = substr(str,j,i-j-1)
>                 gsub(qt qt, qt, arr[n])
>                 j = 0; s = "START" } } }
>     if (s == "INFIELD" || s == "INQUOTES") {
>         arr[++n] = substr(str,j) }
>     else if (s == "MAYBEDOUBLE") {
>         arr[++n] = substr(str,j,length(str)-j)
>         gsub(qt qt, qt, arr[n]) }
>     else if (s == "START") { arr[++n] = "" }
>     return n }
> '
abc,123,"de""fgh",456,"ij,klm"
1 abc
2 123
3 de"fgh
4 456
5 ij,klm
"ij,klm","de""fgh"
1 ij,klm
2 de"fgh
CTRL-D

As in the past, I had trouble with WordPress and comparison operators, so look at the ideone source if that doesn’t come out right. You can run the program at https://ideone.com/5VvGoH.

Posted by programmingpraxis

Filed in Exercises

2 Comments »

2 Responses to “CsvSplit”

John Cowan said
July 30, 2019 at 6:23 PM
The only problem is that newlines are allowed within a double-quoted field, at least by some programs as well as by RFC 4180, the nearest thing to a standard. So awk’s line-by-line model really doesn’t work without great pain.
programmingpraxis said
July 30, 2019 at 6:27 PM
That’s correct. If you need that functionality, the previous exercise linked in the task description provides it. But the current exercise provides a function that is useful in a large percentage of cases.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Programming Praxis

CsvSplit

July 30, 2019

2 Responses to “CsvSplit”

Leave a comment

Categories

Archives

Archives

Programming Praxis

CsvSplit

July 30, 2019

Share this:

Related

2 Responses to “CsvSplit”

Leave a comment

Categories

Archives

Archives