Text File Databases: Part 1

October 19, 2010

There is a lot of data stored in plain-ascii text files consisting of records separated by newlines, each record consisting of multiple fields, and it is useful to have a function library for dealing with them. This exercise looks at some functions for reading the data; the next exercise will look at some functions for processing the data.

We will consider four common types of text file databases. A file with fixed-length data fields has records of a fixed number of characters, each record containing fields that are similarly in fixed positions; the data may be preceded by a fixed-length header. A file with character-delimited fields has variable-length records, each with fields separated by a single-character delimiter; the delimiter is often a tab or vertical bar. A particular type of variable-length delimited text database is known as comma-separated values, where the delimiter is a comma and fields may be surrounded by double-quote characters so that a comma within a quoted field loses its meaning as a field separator; in that case, a literal double-quote character may appear within a quoted field as two double-quote characters in succession. The fourth type that we will consider is a name-value record, where each record consists of multiple fields, one field per line, separated by blank lines, each field consisting of a type-name and a value separated by a delimiter; this format is often used for databases that have many optional fields, such as bibliographic databases.

We want reader functions for each of these file formats that all return a single record each time they are called, or an end-of-file marker when the input is exhausted, and advance the file pointer to the beginning of the next record. The return value should be a list or array, whichever is convenient, containing the value of one field in each element, except for the name-value record, which should return a list of name/value pairs.

Different operating systems have different methods of signalling the end of a line. For maximum portability, your functions should accept the end of a line indicated by a carriage return, a line feed, or both characters in either order. You should be prepared to accept any type of line marker because the data may come from any source; for instance, your computer running MS Windows with a CRLF line marker may fetch data from a Linux computer with a bare LF for the line marker. You should also accept the final line in the file whether or not it has a trailing line marker.

Your task is to write functions to read one record from each of the four file types described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

2 Comments »

2 Responses to “Text File Databases: Part 1”

Programming Praxis – Text File Databases: Part 1 « Bonsai Code said
October 19, 2010 at 1:05 PM
[…] today’s Programming Praxis exercise our goal is to read data from four different types of text file […]

Remco Niemeijer said

October 19, 2010 at 1:05 PM

My Haskell solution (see http://bonsaicode.wordpress.com/2010/10/19/programming-praxis-text-file-databases-part-1/ for a version with comments):

import Control.Applicative ((<*), (<*>), (*>), (<$>))
import Text.Parsec
import Text.Parsec.String

eol :: Parser ()
eol = (char '\n' *> optional (char '\r')) <|>
      (char '\r' *> optional (char '\n')) <|> eof

fixedLength :: [Int] -> Parser [String]
fixedLength fields = foldr (\n p -> (:) <$> count n anyChar <*> p)
                           (return []) fields <* eol

charDelim :: Parser a -> Parser [String]
charDelim sep = manyTill field eol where
    field = manyTill anyChar ((sep *> return ()) <|> lookAhead eol)

csv :: Parser [String]
csv = sepBy field (char ',') <* eol where
    field = quoted <|> many (noneOf ",\n\r")
    quoted = between (char '"') (char '"') $
             many (try (char '"' <* char '"') <|> noneOf "\"")

nameValue :: Parser a -> Parser [(String, String)]
nameValue sep = manyTill field eol where
    field = (,) <$> manyTill anyChar sep <*> manyTill anyChar eol

readDB :: Parser a -> FilePath -> IO (Either ParseError [a])
readDB record = fmap (parse (manyTill record eof) "") . readFile

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Programming Praxis