Grep-CSV
April 9, 2019
Regular readers of this blog know that. in my day job, I frequently process input files from vendors; almost always, they were created in Excel and arrive in CSV format. Sometimes I have to peek inside the files, looking for invalid data, and I have commonly used grep for that task. Sometimes grep gives me unwanted records, because there is a match in some field that is not the field of interested, and I just ignore the extra records. But the other day I had a mess, with lots of unwanted records, so I used awk to parse out the fields and find the records of interest.
I realized as I was performing that task that it would be useful to have a version of grep that understood the CSV file format. So I wrote grep-csv that takes a field number (counting from 1, like awk) and a regular expression and returns the matching rows of a CSV file.
Your task is to write a grep-csv program. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Quikie one in Ruby.
require 'csv' def grep_csv(csv_str, col, regex_str) regex = Regexp.new(regex_str) csv_str .lines .select { |line| CSV.parse_line(line)[col-1] =~ regex } .join end puts grep_csv(File.read(ARGV[0]), ARGV[1].to_i, ARGV[2])Given the
example.csvfile look like this:When we run the program like so:
The output is:
A Haskell version. It uses the Cassava library for parsing and printing CSV
files, pcre-light along with the pcre-heavy front-end for regular expressions,
and optparse-applicative for argument parsing.
The program allows skipping a “header” record, and handles UTF-8 content and
fields that span multiple lines. (In the example below, we match on a word
that appears in the second line of a field.)
The data are lines from poems, taken from UTF-8 SAMPLER.
import qualified Data.ByteString.Lazy as LB import qualified Data.ByteString.UTF8 as UB import Data.Csv.Incremental (encode, encodeRecord) import Data.Csv.Streaming (HasHeader(..), Records(..), decode) import Data.Functor ((<&>)) import Data.Maybe (maybe) import qualified Data.Text as T import Data.Vector ((!?), Vector) import Options.Applicative import System.IO (hPutStrLn, stderr) import Text.Read (readEither) import Text.Regex.PCRE.Heavy ((=~), Regex, compileM) import Text.Regex.PCRE.Light (utf8) data Matcher = Matcher Int Regex type Record = Vector T.Text stream :: (String -> IO ()) -> (Record -> IO ()) -> Records Record -> IO () stream bad good (Cons res recs) = either bad good res *> stream bad good recs stream bad _ (Nil res _) = maybe (pure ()) bad res matchRecord :: Matcher -> Record -> IO () matchRecord (Matcher n regex) record = case record !? n <&> (=~ regex) of Just True -> write record _ -> pure () where write = LB.putStr . encode . encodeRecord badRecord :: String -> IO () badRecord = hPutStrLn stderr argParser :: Parser (Matcher, HasHeader) argParser = (\hdr n re -> (Matcher n re, hdr)) <$> flag NoHeader HasHeader (long "skip-header" <> short 's' <> help "Skip the first, header record in the CSV file") <*> option (eitherReader fieldParser) (long "field" <> short 'f' <> metavar "N" <> help "The 1-based field number to match") <*> option (eitherReader regexParser) (long "regex" <> short 'r' <> metavar "RE" <> help "The regular expression to match against the field") where fieldParser f = readEither f >>= \n -> if n < 1 then Left "the field number, N, must be >= 1" else Right (n-1) regexParser re = compileM (UB.fromString re) [utf8] main :: IO () main = do (matcher, hdr) <- customExecParser (prefs $ showHelpOnEmpty <> showHelpOnError) $ info (argParser <**> helper) (fullDesc <> header "A program to print matching CSV lines.") LB.getContents >>= (stream badRecord (matchRecord matcher) . decode hdr)