Regular Expressions, Part 1
September 15, 2009
In the previous exercise we studied a simple regular-expression matcher. In this exercise, and the one that follows, we will write our own matcher for a somewhat larger set of regular expressions, based on those provided in the books Software Tools and Software Tools in Pascal by Brian W. Kernighan and P. J. Plauger:
- c — the literal character c
- . — any character
- ^ and $ — the beginning or end of the search text
- […] and [^…] — the set (or the negation of the set) of characters between brackets
- x* — zero or more repetitions of the regular expression element x, which may be a literal character c, the unspecified character dot, or a character class
Anywhere a single character may appear, an escaped character may appear. An escape sequence consists of a backslash followed by a single character. If the character is n, the escape sequence represents a newline. If the character is t, the escape sequence represents a horizontal tab. If the character is anything else, it represents itself literally.
A character class consists of characters, escaped characters, or character ranges. A character range consists of a lower-case letter followed by a hyphen followed by a larger lower-case letter, or the same sequence using upper-case letters, or a digit followed by a hyphen followed by a larger digit.
Here are some sample regular expressions:
- [0-9][0-9]* — matches an integer
- ^..*$ — matches a non-empty line of text
- hello — matches the string hello anywhere in the search text
- ^ *hello *$ — matches a search text containing the string hello, possibly surrounded by space characters, but nothing else
- ^[^x].*[0-9] *x$ — matches a string that starts with any character except a literal x, followed by any number (including zero) of characters, followed by a single digit, followed by any number (including zero) of space characters, and ends with a literal x
Pike’s matcher works by interpreting the regular expression directly from its input. That simple method won’t work for the more complicated regular expressions we are considering here, because it is too time-consuming to re-interpret each regular expression element each time it appears. Instead, we will pre-compile the regular expression into an intermediate form, using a parser, then match the input text using the pre-compiled intermediate form. The intermediate form consists of the following elements, arranged in an array or linked list: the beginning-of-line and end-of-line markers, the any character and literal characters, character classes and negated character classes, and closures of the any character, a literal character, or either of the two types of character classes.
Your task today is to determine a suitable intermediate form, then write a parser that takes a regular expression and outputs the corresponding intermediate form; in the next exercise, you will write the corresponding matcher. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.