Comma-Separated Values

March 17, 2009

The comma-separated values file format is commonly used to exchange tabular information, such as spreadsheets, between different programs and operating systems. Though common, the csv format has never been formally defined, and there are variations. One definition is at

We will define a csv file as containing ascii text, with records terminated by a line-termination sequence and fields containing zero or more characters separated by commas; a line-termination sequence is a carriage-return character, or a line-feed character, or both characters in either order. Leading and trailing space in unquoted fields is preserved. Fields may be surrounded by double-quote characters (ascii \042) may contain any characters; such fields may contain newlines, literal commas (ascii \054), and double-quote characters represented as two successive double-quotes. The field separator is normally a comma, but may be changed to an arbitrary character (often a semi-colon) in those European countries that use a comma instead of a decimal point.

The examples below show some of the kinky situations that may arise; for readability, the three-character sequence TAB represents a single tab character (ascii \011):

1,abc,def ghi,jkl,unquoted character strings
2,"abc","def ghi","jkl",quoted character strings
4, abc,def , ghi ,strings with whitespace
5, "abc","def" , "ghi" ,quoted strings with whitespace
6, 123,456 , 789 ,numbers with whitespace
7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
8, -123, +456, 1E3,more numbers with whitespace
9,123 456,123"456, 123 456 ,strange numbers
10,abc",de"f,g"hi,embedded quotes
11,"abc""","de""f","g""hi",quoted embedded quotes
12,"","" "",""x"",doubled quotes
13,"abc"def,abc"def","abc" "def",strange quotes
14,,"", ,empty fields
ghi",jkl,embedded newline
16,abc,"def",789,multiple types of fields

Each of those records has five fields, shown below with the vertical bar character as a field separator:

1|abc|def ghi|jkl|unquoted character strings
2|abc|def ghi|jkl|quoted character strings
4| abc|def | ghi |strings with whitespace
5| "abc"|def | "ghi" |quoted strings with whitespace
6| 123|456 | 789 |numbers with whitespace
7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
8| -123| +456| 1E3|more numbers with whitespace
9|123 456|123"456| 123 456 |strange numbers
10|abc"|de"f|g"hi|embedded quotes
11|abc"|de"f|g"hi|quoted embedded quotes
12|| ""|x""|doubled quotes
13|abcdef|abc"def"|abc "def"|strange quotes
14||| |empty fields
ghi|jkl|embedded newline
16|abc|def|789|multiple types of fields

Write a function that retrieves records from a csv file. Be sure to provide a way to handle user-specified field-separators.


Pages: 1 2

6 Responses to “Comma-Separated Values”

  1. Connochaetes said

    A quick and dirty Ruby hack:

    def separated_vals(filename, separator = ‘,’)
    lines =
    sep = Regexp.compile(“\\s*\\”+separator+”\\s*”) do |file_handle|
    lines << file_handle.gets.split(sep)
    return lines

  2. programmingpraxis said

    I don’t think that properly handles all the cases.

  3. Connochaetes said

    You’re quite right; it doesn’t quite work.

    Here’s a better version:

    def separated_vals(filename, separator = ‘,’)
    lines = do |file_handle|
    while not file_handle.eof? do
    lines << file_handle.gets.strip.split(separator)
    return lines

    This works for all test cases with one omission – it keeps quotes around quoted strings. Now, instead of hacking further, I think I’ll sit down and write a proper state machine :)

  4. FalconNL said

    Haskell (parses every test case correctly):

    import Control.Applicative ((<*), (*>), (<*>), (<$>))
    import Text.Parsec

    main = readFile “csv.txt” >>= print . parseCSV ‘,’

    parseCSV sep = parse (sepBy (line sep) (many1 $ oneOf “\r\n”)) “”
    line sep = sepBy (field sep) (char sep)
    field sep = option “” dropQuotes <+> (ws <+> (keepQuotes <|> plain sep)) <+> ws
    plain sep = many (noneOf $ sep:”\r\n”)
    dropQuotes = try (between’ “\”\”” quoted) <|> between’ “\”” quoted
    keepQuotes = (string “\”” <+> quoted <+> string “\””)
    quoted = many (noneOf “\”” <|> (head <$> try (string “\”\””)))
    ws = many (oneOf ” \t”)
    a <+> b = (++) <$> a <*> b
    between’ s p = string s *> p <* string s [/sourcecode]

  5. […] Reddit or Stack Overflow or someplace about handling CSV files with awk. We’ve done that in a previous exercise, but today I decided to handle CSV files in a different way. Specifically, I wrote an awk function […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: