Regular Expressions, Part 1
September 15, 2009
We represent a regular expression in its intermediate form as a list of rx-elements:
'(bol)
— beginning of line'(eol)
— end of line'(any)
— any (unspecified) character'(lit
c)
— the literal character c'(ccl
c . . .)
— the character class containing the literal characters c . . .'(ncl
c . . .)
— the negated character class not containing any of the literal characters c . . .'(clo
x)
— a closure, consisting of any number (zero or more) of the enclosed rx-element, which may be a literal character, an unspecified character, or a character class
The examples from the prior page are encoded as shown below:
[0-9][0-9]* — ((ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9) (clo ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9))
^..*$ — ((bol) (any) (clo any) (eol))
hello — ((lit #\h) (lit #\e) (lit #\l) (lit #\l) (lit #\o))
^ *hello *$ — ((bol) (clo lit #\space) (lit #\h) (lit #\e) (lit #\l) (lit #\l) (lit #\o) (clo lit #\space) (eol))
^[^x].*[0-9] *x$ — ((bol) (ncl #\x) (clo any) (ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9) (clo lit #\space) (lit x) (eol))
The parser loops through the regular expression, using the list-match
macro from the Standard Prelude to identify rx-elements:
(define (make-rx regexp)
(let loop ((first? #t) (rs (string->list regexp)) (zs '()))
(list-match rs
(() (reverse zs))
((#$) (reverse (cons '(eol) zs)))
((#\^ . rest) first? (loop #f rest (cons '(bol) zs)))
((#\\ c \#* . rest)
(loop #f rest (cons (list 'clo 'lit
(case c ((#\n) #\newline) ((#\t) #\tab) (else c))) zs)))
((#\\ c . rest)
(loop #f rest (cons (list 'lit
(case c ((#\n) #\newline) ((#\t) #\tab) (else c))) zs)))
((#\[ #\^ . rest)
(list-match (get-class rest)
((class) (reverse (cons (cons 'ncl class) zs)))
((class rest)
(if (and (pair? rest) (char=? (car rest) #\*))
(loop #f (cdr rest) (cons (cons 'clo (cons 'ncl class)) zs))
(loop #f rest (cons (cons 'ncl class) zs))))))
((#\[ . rest)
(list-match (get-class rest)
((class) (reverse (cons (cons 'ccl class) zs)))
((class rest)
(if (and (pair? rest) (char=? (car rest) #\*))
(loop #f (cdr rest) (cons (cons 'clo (cons 'ccl class)) zs))
(loop #f rest (cons (cons 'ccl class) zs))))))
((#\. #\* . rest) (loop #f rest (cons (list 'clo 'any) zs)))
((#\. . rest) (loop #f rest (cons '(any) zs)))
((c #\* . rest) (loop #f rest (cons (list 'clo 'lit c) zs)))
((c . rest) (loop #f rest (cons (list 'lit c) zs)))
(else (error 'make-rx "unrecognized regular expression")))))
First?
is #t
initially then #f
forever; it is used to distinguish the caret that indicates a beginning-of-line rx-element from any other caret. Most of the other pieces are unremarkable, except the code for the two kinds of character classes; both call get-class
to extract the characters in the class and point to the next character after the class. Closures are checked explicitly after the literal character, unspecified character, and the two character classes.
Get-class
returns two values in a list, the list of characters that form the class, and the remaining regular-expression text after the class; it uses range
from the Standard Prelude:
(define (get-class class)
(define (char-range a b)
(map integer->char
(range (char->integer a) (+ (char->integer b) 1))))
(let loop ((cs class) (zs '()))
(list-match cs
((#\] . rest) (pair? zs) (list zs rest))
((#\] . rest) (loop rest (cons #\] zs)))
((#\ c . rest) (loop rest (cons c zs)))
((a #\- b . rest)
(or (and (char-numeric? a) (char-numeric? b) (char<? a b))
(and (char-upper-case? a) (char-upper-case? b) (char<? a b))
(and (char-lower-case? a) (char-lower-case? b) (char<? a b)))
(loop rest (append (char-range a b) zs)))
((c . rest) (loop rest (cons c zs)))
(else (error 'get-class "unrecognized class element")))))
Here is the parser in action on the sample regular expressions shown above:
> (make-rx "[0-9][0-9]*")
((ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)
(clo ccl #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9))
> (make-rx "^..*$")
((bol) (any) (clo any) (eol))
> (make-rx "hello")
((lit #\h) (lit #\e) (lit #\l) (lit #\l) (lit #\o))
> (make-rx "^ *hello *$")
((bol) (clo lit #\space) (lit #\h) (lit #\e) (lit #\l)
(lit #\l) (lit #\o) (clo lit #\space) (eol))
> (make-rx "^[^x].*[0-9] *x$")
((bol) (ncl #\x) (clo any) (ccl #\0 #\1 #\2 #\3 #\4 #\5
#\6 #\7 #\8 #\9) (clo lit #\space) (lit x) (eol))
You can run the parser at http://programmingpraxis.codepad.org/A4XmTLTu.
[…] Praxis – Regular Expressions, Part 1 By Remco Niemeijer In today’s Programming Praxis problem our task is to write a parser for simple regular expressions. Since […]
My Haskell solution (see http://bonsaicode.wordpress.com/2009/09/15/programming-praxis-regular-expressions-part-1/ for a version with comments):
Updated code because I overlooked the bit of text between the two bulleted lists:
In your listing here, the “#\\”s (corresponding to lines 55 and 58 of the codepad version) have been mangled into “#\”.
Jamie: Thanks for pointing that out. It’s fixed now.
I showed my compilers class how to build lexers using the derivative of regular expressions:
http://matt.might.net/articles/implementation-of-regular-expression-matching-in-scheme-with-derivatives/
The implementation for regular languages is remarkably compact.
Brzozowski’s regular-expression matcher based on derivatives was on my to-do list. Now I’ll have to take it off.
A scratchy and long implementation in C