Grep-CSV

April 9, 2019

I used the CSV-processing code from the essay on text-file databases, and the regular-expression matcher from a previous exercise, to build this simple grep-csv that reads from standard input and writes to standard output:

(define (grep-csv n regex)
  (for-each-port
    (filter-port read-csv-record
      (lambda (line) (trex regex (list-ref line (- n 1)))))
    (lambda (line) (write-csv-record line))))

Given an input like this:

Charles,Dickens,Great Expectations,1861
Mark,Twain,The Adventures of Tom Sawyer,1876
William,Shakespeare,Julius Caesar,1599
Isaac,Newton,Philosophiae Naturalis Principia Mathematica,1687

We get an output like this:

> (grep-csv 2 "e.*s")
Charles,Dickens,Great Expectations,1861
William,Shakespeare,Julius Caesar,1599

Note that The Adventures of Tom Sawyer and Philosophiae Naturalis Principia Mathematica match the pattern in the title, but those records are not returned because the match is in the wrong field; with grep, they would be returned as unwanted records.

It is handy to have CSV-splitting and regular-expression-matching code available when needed. You can see all of that code and run the program at https://ideone.com/fGHhRb.

Posted by programmingpraxis

Filed in Exercises

2 Comments »

2 Responses to “Grep-CSV”

V said

April 10, 2019 at 1:37 AM

Quikie one in Ruby.

require 'csv'

def grep_csv(csv_str, col, regex_str)
  regex = Regexp.new(regex_str)
  csv_str
    .lines
    .select { |line| CSV.parse_line(line)[col-1] =~ regex }
    .join
end

puts grep_csv(File.read(ARGV[0]), ARGV[1].to_i, ARGV[2])

Given the example.csv file look like this:

Charles,Dickens,Great Expectations,1861
Mark,Twain,The Adventures of Tom Sawyer,1876
William,Shakespeare,Julius Caesar,1599
Isaac,Newton,Philosophiae Naturalis Principia Mathematica,1687

When we run the program like so:

ruby grep-csv.rb example.csv 2 "e.*s"

The output is:

Charles,Dickens,Great Expectations,1861
William,Shakespeare,Julius Caesar,1599

Globules said

April 22, 2019 at 12:01 AM

A Haskell version. It uses the Cassava library for parsing and printing CSV
files, pcre-light along with the pcre-heavy front-end for regular expressions,
and optparse-applicative for argument parsing.

The program allows skipping a “header” record, and handles UTF-8 content and
fields that span multiple lines. (In the example below, we match on a word
that appears in the second line of a field.)

The data are lines from poems, taken from UTF-8 SAMPLER.

import qualified Data.ByteString.Lazy as LB
import qualified Data.ByteString.UTF8 as UB
import           Data.Csv.Incremental (encode, encodeRecord)
import           Data.Csv.Streaming (HasHeader(..), Records(..), decode)
import           Data.Functor ((<&>))
import           Data.Maybe (maybe)
import qualified Data.Text as T
import           Data.Vector ((!?), Vector)
import           Options.Applicative
import           System.IO (hPutStrLn, stderr)
import           Text.Read (readEither)
import           Text.Regex.PCRE.Heavy ((=~), Regex, compileM)
import           Text.Regex.PCRE.Light (utf8)

data Matcher = Matcher Int Regex
type Record  = Vector T.Text

stream :: (String -> IO ()) -> (Record -> IO ()) -> Records Record -> IO ()
stream bad good (Cons res recs) = either bad good res *> stream bad good recs
stream bad _    (Nil  res _)    = maybe (pure ()) bad res

matchRecord :: Matcher -> Record -> IO ()
matchRecord (Matcher n regex) record =
  case record !? n <&> (=~ regex) of
    Just True -> write record
    _         -> pure ()
  where write = LB.putStr . encode . encodeRecord
    
badRecord :: String -> IO ()
badRecord = hPutStrLn stderr

argParser :: Parser (Matcher, HasHeader)
argParser = (\hdr n re -> (Matcher n re, hdr))
         <$> flag NoHeader HasHeader
             (long "skip-header" <>
              short 's' <>
              help "Skip the first, header record in the CSV file")
         <*> option (eitherReader fieldParser)
             (long "field" <>
              short 'f' <>
              metavar "N" <>
              help "The 1-based field number to match")
         <*> option (eitherReader regexParser)
             (long "regex" <>
              short 'r' <>
              metavar "RE" <>
              help "The regular expression to match against the field")
  where fieldParser f = readEither f >>= \n ->
          if n < 1 then Left "the field number, N, must be >= 1"
                   else Right (n-1)
        regexParser re = compileM (UB.fromString re) [utf8]

main :: IO ()
main = do
  (matcher, hdr) <- customExecParser (prefs $ showHelpOnEmpty <>
                                              showHelpOnError)
                  $ info (argParser <**> helper)
                    (fullDesc <>
                     header "A program to print matching CSV lines.")
  LB.getContents >>= 
    (stream badRecord (matchRecord matcher) . decode hdr)

$ cat grepcsv.csv
Language,Author,Title,Sample Lines
Anglo-Saxon,???,???,ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
Middle English,Laȝamon,Brut,"An preost wes on leoden, Laȝamon was ihoten
He wes Leovenaðes sone -- liðe him be Drihten."
Middle High German,Wolfram von Eschenbach,Tagelied,Sîne klâwen durh die wolken sint geslagen
Greek,Odysseas Elytis,???,Τη γλώσσα μου έδωσαν ελληνική
Russian,Alexander Pushkin,Bronze Horseman,"На берегу пустынных волн
Стоял он, дум великих полн,"
Georgian,Shota Rustaveli,ვეფხისტყაოსანი,ვეპხის ტყაოსანი შოთა რუსთაველი
$ ./grepcsv -h
A program to print matching CSV lines.

Usage: grepcsv [-s|--skip-header] (-f|--field N) (-r|--regex RE)

Available options:
  -s,--skip-header         Skip the first, header record in the CSV file
  -f,--field N             The 1-based field number to match
  -r,--regex RE            The regular expression to match against the field
  -h,--help                Show this help text
$ ./grepcsv -s -f 4 --regex "ample|ᚱᚩᚠᚢ|вел..их\B" < grepcsv.csv
Anglo-Saxon,???,???,ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
Russian,Alexander Pushkin,Bronze Horseman,"На берегу пустынных волн
Стоял он, дум великих полн,"

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Programming Praxis

Grep-CSV

April 9, 2019

2 Responses to “Grep-CSV”

Leave a comment

Categories

Archives

Archives

Programming Praxis

Grep-CSV

April 9, 2019

Share this:

Related

2 Responses to “Grep-CSV”

Leave a comment

Categories

Archives

Archives