Beautiful Code

September 11, 2009

We find it more convenient in Scheme to work with lists of characters than strings, which require arithmetic. The top-level function trex converts the regular expression and search text to lists, then calls the matcher:

(define (trex regex text)
  (match (string->list regex) (string->list text)))

Matching begins at the match function. If the regular expression is null, all previous elements of the regular expression matched successfully, and match returns #t to indicate a successful match. If the search text is exhausted before the end of the regular expression, the match fails, and match returns #f. If the match is anchored at the beginning of the text, match calls match-here to est if the regular expression matches the beginning of the text. Otherwise, match calls itself recursively at each position through the search text until if finds a match or exhausts the text.

(define (match regex text)
  (cond ((null? regex) #t)
        ((null? text) #f)
        ((char=? (car regex) #\^)
          (match-here (cdr regex) text))
        (else (or (match-here regex text)
                  (match regex (cdr text))))))

Match-here tests if the regular expression matches at the current position in the input text. If the regular expression is exhausted, the match succeeds, and match-here returns #t. If the regular expression is at least two characters long, and the second character is an asterisk, match-here calls match-star to match the closure. If the regular expression ends with a dollar sign, match-here returns #t if the text is empty (the null list) and #f otherwise. If the input text is not exhausted and the next character in the regular expression is a dot, which matches any character, or is the same as the next character in the text, match-here advances both the regular expression and the text to the next character and calls itself recursively. Otherwise, the match fails, and match-here returns #f.

(define (match-here regex text)
  (cond ((null? regex) #t)
        ((and (pair? (cdr regex))
              (char=? (cadr regex) #\*))
          (match-star (car regex) (cddr regex) text))
        ((and (char=? (car regex) #\$)
              (null? (cdr regex)))
          (null? text))
        ((and (pair? text)
              (or (char=? (car regex) #\.)
                  (char=? (car regex) (car text))))
          (match-here (cdr regex) (cdr text)))
        (else #f)))

Match-star implements closure, which is repetition of a pattern element, using a lazy strategy; it matches the least possible number of repetitions of the pattern element that lead to an overall match of the regular expression. The first cond clause tests for a null match and returns #t immediately. If the input text is not exhausted and the next character in the regular expression is a dot, which matches any character, or is the same as the next character in the text, match-star calls itself recursively to match the remaining text. Otherwise, match-star returns #f to indicate failure.

(define (match-star c regex text)
  (cond ((match-here regex text) #t)
        ((and (pair? text)
              (or (char=? (car text) c)
                  (char=? c #\.)))
          (match-star c regex (cdr text)))
        (else #f)))

As Kernighan and Pike point out, the match, match-here and match-star procedures are mutually recursive, and it is instructive to trace the recursion on complicated regular expressions.

You can run the code at http://programmingpraxis.codepad.org/vv72J3XE, where you will also find a comprehensive test suite.

Pages: 1 2

7 Responses to “Beautiful Code”

  1. […] Praxis – Beautiful Code By Remco Niemeijer Today’s Programming Praxis is about beautiful code. Specifically, it concerns a bit of C code that can […]

  2. Remco Niemeijer said

    My Haskell solution (see http://bonsaicode.wordpress.com/2009/09/11/programming-praxis-beautiful-code/ for a version with comments):

    import Data.List
    
    match :: String -> String -> Bool
    match ('^':r) = matchHere r
    match r       = or . map (matchHere r) . tails
    
    matchHere :: String -> String -> Bool
    matchHere (c:'*':r) xs  = matchStar c r xs
    matchHere "$"       xs  = null xs
    matchHere (r:rs) (x:xs) = (r == '.' || r == x) && matchHere rs xs
    matchHere r      _      = null r
    
    matchStar :: Char -> String -> String -> Bool
    matchStar _ r xs     | matchHere r xs = True
    matchStar c r (x:xs) = (c == '.' || c == x) && matchStar c r xs
    matchStar _ _ _      = False
    
  3. Tordek said

    Assuming strings as lists of characters (They’re usually atoms, so they should be transformed):

    Prolog solution.

    match(['^'|Rs], T)          :- matchHere(Rs, T).
    match(R, T)                 :- matchHere(R, T).
    match(R, [_|Ts])            :- matchHere(R, Ts).
    
    matchHere([], _).
    matchHere(['$'], []).
    matchHere([R,'*'|Rs], T)    :- matchStar(R, Rs, T).
    matchHere([R|Rs], [R|Ts])   :- matchHere(Rs, Ts).
    matchHere(['.'|Rs], [_|Ts]) :- matchHere(Rs, Ts).
    
    matchStar(_, Rs, Ts)        :- matchHere(Rs, Ts).
    matchStar(R, Rs, [R|Ts])    :- matchHere(Rs, Ts).
    matchStar('.', Rs, [_|Ts])  :- matchHere(Rs, Ts).
    matchStar(R, Rs, [R|Ts])    :- matchStar(R, Rs, Ts).
    matchStar('.', Rs, [_|Ts])  :- matchStar('.', Rs, Ts).

  4. My Common Lisp solution:

    (defun empty (seq)
      (declare (inline))
      (= 0 (length seq)))
    
    (defun match (re text)
      "search for re anywhere in text"
      (if (and (not (empty re)) (char= #\^ (elt re 0)))
          (matchhere (subseq re 1) text)
          ;; must look at empty string
          (cond ((matchhere re text))
    	    ((empty text)
    	     nil)
    	    ((match re (subseq text 1))))))
    
    (defun matchhere (re text)
      "search for re at beginning of text"
      (cond ((empty re))
    	((and (< 1 (length re)) (char= #\* (elt re 1)))
    	 (matchstar (elt re 0) (subseq re 2) text))
    	((and (char= #\$ (elt re 0)) (= 1 (length re)))
    	 (empty text))
    	((and (not (empty text))
    	      (or (char= #\. (elt re 0)) (char= (elt re 0) (elt text 0))))
    	 (matchhere (subseq re 1) (subseq text 1)))))
    
    (defun matchstar (c re text)
      "search for c*re at beginning of text"
      (cond ((matchhere re text))
    	((and (not (empty text))
    	      (or (char= c (elt text 0)) (char= #\. c)))
    	 (matchstar c re (subseq text 1)))))
    
  5. Michel said

    My Clojure solution. Clojure uses Java strings, which are immutable, so taking substring would be too expensive. Instead, I made match-here and match-* internal functions, and give them indices rather than strings to work with.

    Solution, on GitHub

  6. Lautaro Pecile said
    from operator import truth as not_empty
    
    is_empty = lambda s: not not_empty(s)
    
    def iterate(f, string):
        s = string
        while not_empty(s):
            yield f(s)
            s = s[1:]
    
    def match (re, text):    
        if re.startswith('^'):
            return matchhere(re[1:], text)
        return any(iterate(lambda t: matchhere(re, t), text))
            
    
    def matchhere(re, text):
        if is_empty(re):
            return True
        if re.startswith('*', 1):
            return matchstar(re[0], re[2:], text)
        if re.startswith('$'):
            return is_empty(text)
        if (not_empty(text) and (re.startswith('.') or re[0] == text[0])):
            return matchhere(re[1:], text[1:])
        return False
        
    def matchstar(c, re, text):
        while True:                
            if matchhere(re, text):
                return True
            text = text[1:]
            if is_empty(text) or not (text.startswith(c) or c == '.'):
                break
        return False
    
    

Leave a comment