Comma-Separated Values

March 17, 2009

The comma-separated values file format is commonly used to exchange tabular information, such as spreadsheets, between different programs and operating systems. Though common, the csv format has never been formally defined, and there are variations. One definition is at http://www.rfc-editor.org/rfc/rfc4180.txt.

We will define a csv file as containing ascii text, with records terminated by a line-termination sequence and fields containing zero or more characters separated by commas; a line-termination sequence is a carriage-return character, or a line-feed character, or both characters in either order. Leading and trailing space in unquoted fields is preserved. Fields may be surrounded by double-quote characters (ascii \042) may contain any characters; such fields may contain newlines, literal commas (ascii \054), and double-quote characters represented as two successive double-quotes. The field separator is normally a comma, but may be changed to an arbitrary character (often a semi-colon) in those European countries that use a comma instead of a decimal point.

The examples below show some of the kinky situations that may arise; for readability, the three-character sequence TAB represents a single tab character (ascii \011):

1,abc,def ghi,jkl,unquoted character strings
2,"abc","def ghi","jkl",quoted character strings
3,123,456,789,numbers
4, abc,def , ghi ,strings with whitespace
5, "abc","def" , "ghi" ,quoted strings with whitespace
6, 123,456 , 789 ,numbers with whitespace
7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
8, -123, +456, 1E3,more numbers with whitespace
9,123 456,123"456, 123 456 ,strange numbers
10,abc",de"f,g"hi,embedded quotes
11,"abc""","de""f","g""hi",quoted embedded quotes
12,"","" "",""x"",doubled quotes
13,"abc"def,abc"def","abc" "def",strange quotes
14,,"", ,empty fields
15,abc,"def
ghi",jkl,embedded newline
16,abc,"def",789,multiple types of fields

Each of those records has five fields, shown below with the vertical bar character as a field separator:

1|abc|def ghi|jkl|unquoted character strings
2|abc|def ghi|jkl|quoted character strings
3|123|456|789|numbers
4| abc|def | ghi |strings with whitespace
5| "abc"|def | "ghi" |quoted strings with whitespace
6| 123|456 | 789 |numbers with whitespace
7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
8| -123| +456| 1E3|more numbers with whitespace
9|123 456|123"456| 123 456 |strange numbers
10|abc"|de"f|g"hi|embedded quotes
11|abc"|de"f|g"hi|quoted embedded quotes
12|| ""|x""|doubled quotes
13|abcdef|abc"def"|abc "def"|strange quotes
14||| |empty fields
15|abc|def
ghi|jkl|embedded newline
16|abc|def|789|multiple types of fields

Write a function that retrieves records from a csv file. Be sure to provide a way to handle user-specified field-separators.

Pages: 1 2

Follow

Get every new post delivered to your Inbox.

Join 645 other followers