Regular Expressions, Part 2

September 18, 2009

The matcher is similar to the Rob Pike’s. The top-level function, rx-match?, calls match-here if the regular expression starts with a bol element, or calls match if the regular expression is not anchored to the beginning of the input text:

(define (rx-match? rx text) (cond ((null? rx) #t) ((equal? (car rx) '(bol)) (match-here (cdr rx) (string->list text))) (else (match rx (string->list text)))))

Match tries to find a match anywhere in the input text, calling match-here at each input character until it either finds a match or runs out of input:

(define (match rx text) (cond ((null? rx) #t) ((null? text) (match-here rx text)) (else (or (match-here rx text) (match rx (cdr text))))))

Match-here considers a subset of the input text starting at a particular character; it handles eol directly, calls match-star to handle closures, and calls match-one to match any other rx-element:

(define (match-here rx text) (if (null? rx) #t (case (caar rx) ((eol) (null? text)) ((clo) (match-star rx text)) (else (and (pair? text) (match-one (car rx) (car text)) (match-here (cdr rx) (cdr text)))))))

Match-star searches for closures in a manner similar to Pike’s matcher, calling match-one to test each rx-element:

(define (match-star rx text) (cond ((match-here (cdr rx) text) #t) ((and (pair? text) (match-one (cdar rx) (car text))) (match-star rx (cdr text))) (else #f)))

Match-one matches a single rx-element against a single character of the input text:

(define (match-one rx text) (case (car rx) ((any) #t) ((lit) (char=? (cadr rx) text)) ((ccl) (member text (cdr rx))) ((ncl) (not (member text (cdr rx))))))

Here is the matcher in action on some of the patterns of the prior exercise:

> (rx-match? ((ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9) (clo ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)) "hello") #f > (rx-match? ((bol) (any) (clo any) (eol)) "hello") #t > (rx-match? ((lit #\h) (lit #\e) (lit #\l) (lit #\l) (lit #\o)) "hello") #t > (rx-match? ((bol) (clo lit #\space) (lit #\h) (lit #\e) (lit #\l) (lit #\l) (lit #\o) (clo lit #\space) (eol)) "hello") #t > (rx-match? ((bol) (ncl #\x) (clo any) (ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9) (clo lit #\space) (lit x) (eol)) "hello") #f

You can run the program at http://programmingpraxis.codepad.org/eUxD9mwZ.

Posted by programmingpraxis

Filed in Exercises

7 Comments »

7 Responses to “Regular Expressions, Part 2”

kbob said
September 18, 2009 at 2:17 PM
I hope you’re planning on moving on to NFA-based regular expression matching, because the naive, exponential-time regex engine this series is developing is severely suboptimal. Automata, NFA and DFA, are beautiful, useful, and not difficult.

Remco Niemeijer said

September 18, 2009 at 8:32 PM

My Haskell solution (see http://bonsaicode.wordpress.com/2009/09/18/programming-praxis-regular-expressions-part-2/ for a version with comments):

import Data.List

data Elem = Lit Char | Esc Char | Any | Set Bool [Elem] deriving Show
data Chunk = Elem Elem | BoL | EoL | Star Elem deriving Show

matchElem :: Elem -> Char -> Bool
matchElem e c = case e of Lit l    -> l == c
                          Esc s    -> show c == '\\':[s]
                          Set b es -> b == any (`matchElem` c) es
                          Any      -> True

matchHere :: [Chunk] -> String -> Bool
matchHere (Elem r:rs) (x:xs) = matchElem r x && matchHere rs xs
matchHere (Star e:r)  xs     = matchStar e r xs
matchHere [EoL]       xs     = null xs
matchHere r           _      = null r

matchStar :: Elem -> [Chunk] -> String -> Bool
matchStar _ r xs     | matchHere r xs = True
matchStar e r (x:xs) = matchElem e x && matchStar e r xs
matchStar _ _ _      = False

match :: [Chunk] -> String -> Bool
match (BoL:r) = matchHere r
match r       = any (matchHere r) . tails

Remco Niemeijer said
September 18, 2009 at 8:37 PM
On a side note, the ‘HOWTO: Posting source code’ page no longer shows the actual tags required. You might want to fix that.

Also, the source code plugin now seems to htmlEncode characters like & and >, which screws up function types and such.
Posts about Programming from google blogs as of September 18, 2009 « tryfly.com said
September 18, 2009 at 11:47 PM
[…] […]
programmingpraxis said
September 19, 2009 at 1:18 AM
kbob: This blog is getting programmers to write code, so that they can maintain and improve their coding skills. This blog is not about computer science or the comparative study of algorithms. It really doesn’t matter to me that I have the fastest algorithm that solves the particular programming problem I am discussing. I do have a two-part series regex->nfa, nfa->dfa on my idea list, but that doesn’t mean I’ll ever write it.
programmingpraxis said
September 19, 2009 at 2:25 AM
Remco: I fixed the ‘sourcecode’ tags on the HOWTOpage. And I sent a bug report to the WordPress support center. Thanks for pointing out the problems.
programmingpraxis said
September 20, 2009 at 12:14 PM
Sheri, from the WordPress Support Department, reports that the bug in the ‘sourcecode’ feature has now been fixed. I reported the bug on Friday evening and it was fixed by Sunday morning. Well done, WordPress!

Phil

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Programming Praxis