Beautiful Code
September 11, 2009
We find it more convenient in Scheme to work with lists of characters than strings, which require arithmetic. The top-level function trex
converts the regular expression and search text to lists, then calls the matcher:
(define (trex regex text)
(match (string->list regex) (string->list text)))
Matching begins at the match
function. If the regular expression is null, all previous elements of the regular expression matched successfully, and match
returns #t
to indicate a successful match. If the search text is exhausted before the end of the regular expression, the match fails, and match
returns #f
. If the match is anchored at the beginning of the text, match
calls match-here
to est if the regular expression matches the beginning of the text. Otherwise, match
calls itself recursively at each position through the search text until if finds a match or exhausts the text.
(define (match regex text)
(cond ((null? regex) #t)
((null? text) #f)
((char=? (car regex) #\^)
(match-here (cdr regex) text))
(else (or (match-here regex text)
(match regex (cdr text))))))
Match-here
tests if the regular expression matches at the current position in the input text. If the regular expression is exhausted, the match succeeds, and match-here
returns #t
. If the regular expression is at least two characters long, and the second character is an asterisk, match-here
calls match-star
to match the closure. If the regular expression ends with a dollar sign, match-here
returns #t
if the text is empty (the null list) and #f
otherwise. If the input text is not exhausted and the next character in the regular expression is a dot, which matches any character, or is the same as the next character in the text, match-here
advances both the regular expression and the text to the next character and calls itself recursively. Otherwise, the match fails, and match-here
returns #f
.
(define (match-here regex text)
(cond ((null? regex) #t)
((and (pair? (cdr regex))
(char=? (cadr regex) #\*))
(match-star (car regex) (cddr regex) text))
((and (char=? (car regex) #\$)
(null? (cdr regex)))
(null? text))
((and (pair? text)
(or (char=? (car regex) #\.)
(char=? (car regex) (car text))))
(match-here (cdr regex) (cdr text)))
(else #f)))
Match-star
implements closure, which is repetition of a pattern element, using a lazy strategy; it matches the least possible number of repetitions of the pattern element that lead to an overall match of the regular expression. The first cond
clause tests for a null match and returns #t
immediately. If the input text is not exhausted and the next character in the regular expression is a dot, which matches any character, or is the same as the next character in the text, match-star
calls itself recursively to match the remaining text. Otherwise, match-star
returns #f
to indicate failure.
(define (match-star c regex text)
(cond ((match-here regex text) #t)
((and (pair? text)
(or (char=? (car text) c)
(char=? c #\.)))
(match-star c regex (cdr text)))
(else #f)))
As Kernighan and Pike point out, the match
, match-here
and match-star
procedures are mutually recursive, and it is instructive to trace the recursion on complicated regular expressions.
You can run the code at http://programmingpraxis.codepad.org/vv72J3XE, where you will also find a comprehensive test suite.
[…] Praxis – Beautiful Code By Remco Niemeijer Today’s Programming Praxis is about beautiful code. Specifically, it concerns a bit of C code that can […]
My Haskell solution (see http://bonsaicode.wordpress.com/2009/09/11/programming-praxis-beautiful-code/ for a version with comments):
Assuming strings as lists of characters (They’re usually atoms, so they should be transformed):
Prolog solution.
[…] Beautiful Code « Programming Praxis. […]
My Common Lisp solution:
My Clojure solution. Clojure uses Java strings, which are immutable, so taking substring would be too expensive. Instead, I made match-here and match-* internal functions, and give them indices rather than strings to work with.
Solution, on GitHub