Grep-CSV

April 9, 2019

Regular readers of this blog know that. in my day job, I frequently process input files from vendors; almost always, they were created in Excel and arrive in CSV format. Sometimes I have to peek inside the files, looking for invalid data, and I have commonly used grep for that task. Sometimes grep gives me unwanted records, because there is a match in some field that is not the field of interested, and I just ignore the extra records. But the other day I had a mess, with lots of unwanted records, so I used awk to parse out the fields and find the records of interest.

I realized as I was performing that task that it would be useful to have a version of grep that understood the CSV file format. So I wrote grep-csv that takes a field number (counting from 1, like awk) and a regular expression and returns the matching rows of a CSV file.

Your task is to write a grep-csv program. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

2 Responses to “Grep-CSV”

  1. V said

    Quikie one in Ruby.

    require 'csv'
    
    def grep_csv(csv_str, col, regex_str)
      regex = Regexp.new(regex_str)
      csv_str
        .lines
        .select { |line| CSV.parse_line(line)[col-1] =~ regex }
        .join
    end
    
    puts grep_csv(File.read(ARGV[0]), ARGV[1].to_i, ARGV[2])
    
    

    Given the example.csv file look like this:

    Charles,Dickens,Great Expectations,1861
    Mark,Twain,The Adventures of Tom Sawyer,1876
    William,Shakespeare,Julius Caesar,1599
    Isaac,Newton,Philosophiae Naturalis Principia Mathematica,1687  
    

    When we run the program like so:

    ruby grep-csv.rb example.csv 2 "e.*s"
    

    The output is:

    Charles,Dickens,Great Expectations,1861
    William,Shakespeare,Julius Caesar,1599
    
  2. Globules said

    A Haskell version. It uses the Cassava library for parsing and printing CSV
    files, pcre-light along with the pcre-heavy front-end for regular expressions,
    and optparse-applicative for argument parsing.

    The program allows skipping a “header” record, and handles UTF-8 content and
    fields that span multiple lines. (In the example below, we match on a word
    that appears in the second line of a field.)

    The data are lines from poems, taken from UTF-8 SAMPLER.

    import qualified Data.ByteString.Lazy as LB
    import qualified Data.ByteString.UTF8 as UB
    import           Data.Csv.Incremental (encode, encodeRecord)
    import           Data.Csv.Streaming (HasHeader(..), Records(..), decode)
    import           Data.Functor ((<&>))
    import           Data.Maybe (maybe)
    import qualified Data.Text as T
    import           Data.Vector ((!?), Vector)
    import           Options.Applicative
    import           System.IO (hPutStrLn, stderr)
    import           Text.Read (readEither)
    import           Text.Regex.PCRE.Heavy ((=~), Regex, compileM)
    import           Text.Regex.PCRE.Light (utf8)
    
    data Matcher = Matcher Int Regex
    type Record  = Vector T.Text
    
    stream :: (String -> IO ()) -> (Record -> IO ()) -> Records Record -> IO ()
    stream bad good (Cons res recs) = either bad good res *> stream bad good recs
    stream bad _    (Nil  res _)    = maybe (pure ()) bad res
    
    matchRecord :: Matcher -> Record -> IO ()
    matchRecord (Matcher n regex) record =
      case record !? n <&> (=~ regex) of
        Just True -> write record
        _         -> pure ()
      where write = LB.putStr . encode . encodeRecord
        
    badRecord :: String -> IO ()
    badRecord = hPutStrLn stderr
    
    argParser :: Parser (Matcher, HasHeader)
    argParser = (\hdr n re -> (Matcher n re, hdr))
             <$> flag NoHeader HasHeader
                 (long "skip-header" <>
                  short 's' <>
                  help "Skip the first, header record in the CSV file")
             <*> option (eitherReader fieldParser)
                 (long "field" <>
                  short 'f' <>
                  metavar "N" <>
                  help "The 1-based field number to match")
             <*> option (eitherReader regexParser)
                 (long "regex" <>
                  short 'r' <>
                  metavar "RE" <>
                  help "The regular expression to match against the field")
      where fieldParser f = readEither f >>= \n ->
              if n < 1 then Left "the field number, N, must be >= 1"
                       else Right (n-1)
            regexParser re = compileM (UB.fromString re) [utf8]
    
    main :: IO ()
    main = do
      (matcher, hdr) <- customExecParser (prefs $ showHelpOnEmpty <>
                                                  showHelpOnError)
                      $ info (argParser <**> helper)
                        (fullDesc <>
                         header "A program to print matching CSV lines.")
      LB.getContents >>= 
        (stream badRecord (matchRecord matcher) . decode hdr)
    
    $ cat grepcsv.csv
    Language,Author,Title,Sample Lines
    Anglo-Saxon,???,???,ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
    Middle English,Laȝamon,Brut,"An preost wes on leoden, Laȝamon was ihoten
    He wes Leovenaðes sone -- liðe him be Drihten."
    Middle High German,Wolfram von Eschenbach,Tagelied,Sîne klâwen durh die wolken sint geslagen
    Greek,Odysseas Elytis,???,Τη γλώσσα μου έδωσαν ελληνική
    Russian,Alexander Pushkin,Bronze Horseman,"На берегу пустынных волн
    Стоял он, дум великих полн,"
    Georgian,Shota Rustaveli,ვეფხისტყაოსანი,ვეპხის ტყაოსანი შოთა რუსთაველი
    $ ./grepcsv -h
    A program to print matching CSV lines.
    
    Usage: grepcsv [-s|--skip-header] (-f|--field N) (-r|--regex RE)
    
    Available options:
      -s,--skip-header         Skip the first, header record in the CSV file
      -f,--field N             The 1-based field number to match
      -r,--regex RE            The regular expression to match against the field
      -h,--help                Show this help text
    $ ./grepcsv -s -f 4 --regex "ample|ᚱᚩᚠᚢ|вел..их\B" < grepcsv.csv
    Anglo-Saxon,???,???,ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
    Russian,Alexander Pushkin,Bronze Horseman,"На берегу пустынных волн
    Стоял он, дум великих полн,"
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: