Regular Expressions, Part 1

September 15, 2009

In the previous exercise we studied a simple regular-expression matcher. In this exercise, and the one that follows, we will write our own matcher for a somewhat larger set of regular expressions, based on those provided in the books Software Tools and Software Tools in Pascal by Brian W. Kernighan and P. J. Plauger:

  • c — the literal character c
  • . — any character
  • ^ and $ — the beginning or end of the search text
  • […] and [^…] — the set (or the negation of the set) of characters between brackets
  • x* — zero or more repetitions of the regular expression element x, which may be a literal character c, the unspecified character dot, or a character class

Anywhere a single character may appear, an escaped character may appear. An escape sequence consists of a backslash followed by a single character. If the character is n, the escape sequence represents a newline. If the character is t, the escape sequence represents a horizontal tab. If the character is anything else, it represents itself literally.

A character class consists of characters, escaped characters, or character ranges. A character range consists of a lower-case letter followed by a hyphen followed by a larger lower-case letter, or the same sequence using upper-case letters, or a digit followed by a hyphen followed by a larger digit.

Here are some sample regular expressions:

  • [0-9][0-9]* — matches an integer
  • ^..*$ — matches a non-empty line of text
  • hello — matches the string hello anywhere in the search text
  • ^ *hello *$ — matches a search text containing the string hello, possibly surrounded by space characters, but nothing else
  • ^[^x].*[0-9] *x$ — matches a string that starts with any character except a literal x, followed by any number (including zero) of characters, followed by a single digit, followed by any number (including zero) of space characters, and ends with a literal x

Pike’s matcher works by interpreting the regular expression directly from its input. That simple method won’t work for the more complicated regular expressions we are considering here, because it is too time-consuming to re-interpret each regular expression element each time it appears. Instead, we will pre-compile the regular expression into an intermediate form, using a parser, then match the input text using the pre-compiled intermediate form. The intermediate form consists of the following elements, arranged in an array or linked list: the beginning-of-line and end-of-line markers, the any character and literal characters, character classes and negated character classes, and closures of the any character, a literal character, or either of the two types of character classes.

Your task today is to determine a suitable intermediate form, then write a parser that takes a regular expression and outputs the corresponding intermediate form; in the next exercise, you will write the corresponding matcher. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

8 Responses to “Regular Expressions, Part 1”

  1. […] Praxis – Regular Expressions, Part 1 By Remco Niemeijer In today’s Programming Praxis problem our task is to write a parser for simple regular expressions. Since […]

  2. Remco Niemeijer said

    My Haskell solution (see http://bonsaicode.wordpress.com/2009/09/15/programming-praxis-regular-expressions-part-1/ for a version with comments):

    import Control.Applicative ((<$>), (*>), (<*), (<*>))
    import Text.Parsec
    import Text.Parsec.String
    
    data Elem = Lit Char | Any | Set Bool [Char] deriving Show
    data Chunk = Elem Elem | BoL | EoL | Star Elem deriving Show
    
    regex :: Parser [Chunk]
    regex = (++) <$> bol <*> many chunk where
        bol = option [] (const [BoL] <$> char '^')
        chunk = choice [Star <$> try (element <* char '*'),
                        const EoL <$> try (char '$' <* eof),
                        Elem <$> element]
        element = choice [const Any <$> char '.',
                          Set False . expandSet <$> set "[^",
                          Set True . expandSet <$> set "[",
                          Lit <$> anyChar]
        set s = try (string s *> many1 (noneOf "]") <* char '&#93;')
        expandSet (a:'-':b:xs) = &#91;a..b&#93; ++ expandSet xs
        expandSet (x:xs) = x : expandSet xs
        expandSet _ = &#91;&#93;
    
    parseRegex :: String -> Either ParseError [Chunk]
    parseRegex = parse regex ""
    
  3. Remco Niemeijer said

    Updated code because I overlooked the bit of text between the two bulleted lists:

    import Control.Applicative ((<$>), (*>), (<*), (<*>))
    import Data.Char
    import Text.Parsec
    import Text.Parsec.String
    
    data Elem = Lit Char | Esc Char | Any | Set Bool [Elem] deriving Show
    data Chunk = Elem Elem | BoL | EoL | Star Elem deriving Show
    
    regex :: Parser [Chunk]
    regex = (++) <$> bol <*> many chunk where
        bol = option [] (const [BoL] <$> char '^')
        chunk = choice [Star <$> try (element <* char '*'),
                        const EoL <$> try (char '$' <* eof),
                        Elem <$> element]
        element = choice [esc <$> try (char '\\' *> anyChar),
                          const Any <$> char '.',
                          Set False . expandSet <$> set "[^",
                          Set True . expandSet <$> set "[",
                          Lit <$> noneOf "]"]
        esc c = if elem c "nt" then Esc c else Lit c
        set s = try (string s *> many1 element <* char '&#93;')
        expandSet (Lit a:Lit '-':Lit b:xs)
            | validRange a b = map Lit &#91;a..b&#93; ++ expandSet xs
        expandSet (x:xs) = x : expandSet xs
        expandSet _ = &#91;&#93;
        validRange a b = b > a && ((isLower a && isLower b) ||
                                   (isUpper a && isUpper b) ||
                                   (isDigit a && isDigit b))
        
    parseRegex :: String -> Either ParseError [Chunk]
    parseRegex = parse regex ""
    
  4. Jamie Hope said

    In your listing here, the “#\\”s (corresponding to lines 55 and 58 of the codepad version) have been mangled into “#\”.

  5. programmingpraxis said

    Jamie: Thanks for pointing that out. It’s fixed now.

  6. Matt Might said

    I showed my compilers class how to build lexers using the derivative of regular expressions:

    http://matt.might.net/articles/implementation-of-regular-expression-matching-in-scheme-with-derivatives/

    The implementation for regular languages is remarkably compact.

  7. programmingpraxis said

    Brzozowski’s regular-expression matcher based on derivatives was on my to-do list. Now I’ll have to take it off.

  8. Vikas Tandi said

    A scratchy and long implementation in C

    #include <stdio.h>
    #include <ctype.h>
    
    static int copy_code(int code, char *s, int i);
    static int copy_character(int c, char *s, int i);
    static int copy_special_characters(int code, char *s, int i);
    
    static char codes[][4] = {"bol", "eol", "any", "lit", "ccl", "ncl",	"clo"};
    enum{ BOL = 0, EOL, ANY, LIT, CCL, NCL, CLO};
    
    static char special_characters[][8] = {"space", "newline", "tab"};
    enum{SPACE, NEWLINE, TAB};
    
    enum{VALID_REGEXP = 0, INVALID_REGEXP};
    
    int encode_regular_expression(char *regexp, char *encode)
    {
    	/* i is the index of current character in regexp ang j is the current character being output */
    	int i, j, escaped;
    	char c;
    
    	i = j = escaped = 0;
    	encode[j++] = '(';
    	/* parse through the regular expression */
    	for(;;)
    	{
    		/* read the character in regexp in c */
    		c = regexp[i++];
    
    		if(c == '\0')
    			break;
    
    		if(c == '\\')								/* escape character */
    		{
    			escaped = 1;
    			continue;
    		}
    
    		encode[j++] = '(';
    
    		if(!escaped && c == '^' && i == 1)					/* begining of line character '^' is not escaped and located at the start of regexp */
    			j = copy_code(BOL, encode, j);
    		else if(!escaped && c == '$' && regexp[i] == '\0')		/* end of line character '$' is not escaped and located at the end of regexp */
    			j = copy_code(EOL, encode, j);
    		else if(!escaped && c == '.')							/* any character */
    		{
    			if(regexp[i] == '*')
    			{
    				j = copy_code(CLO, encode, j);
    				encode[j++] = ' ';
    				i++;
    			}
    
    			j = copy_code(ANY, encode, j);
    		}
    		else if(!escaped && c == '[')							/* begining of character class */
    		{
    			int k;
    
    			/* check for closure and validity of character class */
    			for(k = i; regexp[k] != '\0' && regexp[k] != ']'; k++);
    
    			/* if no ending bracket, return invalid regexp */
    			if(regexp[k] == '\0')
    				return INVALID_REGEXP;
    
    			/* check for closure */
    			if(regexp[k+1] == '*')
    			{
    				j = copy_code(CLO, encode, j);
    				encode[j++] = ' ';
    			}
    
    			/* check for negated character */
    			if(regexp[i] == '^')
    			{
    				j = copy_code(NCL, encode, j);
    				i++;
    			}
    			else
    				j = copy_code(CCL, encode, j);
    
    			while(regexp[i] != '\0' && regexp[i] != ']')
    			{
    				char start, end;
    
    				if(regexp[i+1] != '-')
    				{
    					j = copy_character(regexp[i], encode, j);
    					i++;
    					continue;
    				}
    
    				if(!isalnum(regexp[i]) && !isalnum(regexp[i+2]))
    					return INVALID_REGEXP;
    
    				if(isalpha(regexp[i])  && !isalpha(regexp[i+2]))
    					return INVALID_REGEXP;
    
    				if(isdigit(regexp[i])  && !isdigit(regexp[i+2]))
    					return INVALID_REGEXP;
    
    				if(regexp[i] > regexp[i+2])
    					return INVALID_REGEXP;
    
    				/*now expand the given range */
    				for(start = regexp[i], end = regexp[i+2]; start <= end; start++)
    					j = copy_character(start, encode, j);
    
    				i = i + 3;
    			}
    
    			i++;
    			if(regexp[i] == '*')
    				i++;
    		}
    		else							/* literal character */
    		{
    			if(regexp[i] == '*')
    			{
    				j = copy_code(CLO, encode, j);
    				encode[j++] = ' ';
    				i++;
    			}
    
    			j = copy_code(LIT, encode, j);
    
    			/* handle special characters */
    			if(escaped && c == 'n')
    				j = copy_special_characters(NEWLINE, encode, j);
    			else if(escaped && c == 't')
    				j = copy_special_characters(TAB, encode, j);
    			else if(c == ' ')
    				j = copy_special_characters(SPACE, encode, j);
    			else
    				j = copy_character(c, encode, j);
    		}
    		encode[j++] = ')';
    		encode[j++] = ' ';
    	}
    	encode[j-1] = ')';
    	encode[j] = '\0';
    	return VALID_REGEXP;
    }
    
    static int copy_code(int code, char *s, int i)
    {
    	int j;
    
    	for(j = 0; codes[code][j] != '\0'; j++)
    		s[i++] = codes[code][j];
    
    	return i;
    }
    
    static int copy_special_characters(int code, char *s, int i)
    {
    	int j;
    
    	s[i++] = ' ';
    	s[i++] = '#';
    	s[i++] = '\\';
    
    	for(j = 0; special_characters[code][j] != '\0'; j++)
    		s[i++] = special_characters[code][j];
    
    	return i;
    }
    
    static int copy_character(int c, char *s, int i)
    {
    	s[i++] = ' ';
    	s[i++] = '#';
    	s[i++] = '\\';
    	s[i++] = c;
    
    	return i;
    }
    

Leave a comment