Wednesday, May 21, 2008

Making Regexes Readable

Regular expressions are extremely powerful. They have a tendency, however, to grow and turn into unreadable messes. What have people done to try to tame them?

Perl is often on the forefront of regex technology. It allows multiline regexes with ignored whitespace and comments. That's nice, and it's a great step in the right direction. If your regex grows much more than that example, then you'll still have a mess.

What is it that makes large programs readable? More than anything, subroutines do it. I really want to be able to create something analogous to subroutines in my regex. I'd like to be able to create a small, understandable regex that defines part of a complicated regex. Then I'd like the complex regex to be able to refer to the smaller one by name.

Once again, we can look at Perl. Well, we can almost look to Perl. Perl allows you to something called an overloaded constant. It looks as though these can define things like a new escape sequence that's usable in a regex. I won't claim that I understand it, but this page talks about it some. It seems to do the right thing, but I can't find many people who use it, so it must have problems. I'm going to guess that the scope of the new escape sequence is visible to all regular expressions. That would make it awkward to use safely.

Python, Ruby, and .NET don't have the features that I'd want. They tend to have fairly conventional regex libraries, however. It looks as though I'll have to look elsewhere.

Boost.Xpressive takes a completely different approach to regular expressions. This is an impressive C++ expression template library written by Eric Neibler. It allows you to create conventional regexes. It also allows a completely different approach, however.

This approach goes a long ways towards making complex regexes readable, but it's not without problems.

Here's an example: /\$\d+\.\d\d/ is a Ruby regular expression to match dollar amounts such as "$3.12". It's a very simple regex, and a static xpressive regex gets a lot more verbose:

sregex dollars = '$' >> +_d >> '.' >> _d >> _d;

Remember, this is C++. A lot of the operators that conventional regexes use aren't available. For example, a prefix + operator is used instead of postfix one. C++ also has no whitespace operator. >> takes the place of this. The result is a fairly messy syntax.

However, you can do some really great things with this. You can, for example, use a regex inside another regex.

sregex yen = +_d >> '¥';
sregex currency = dollars | yen;

You can start to see that, while simple regexes are worse looking, the ability to combine individual, named regexes together allows complex regexes to look much cleaner.

I'm not convinced that Boost.Xpressive is the answer. C++'s limitations show through the library's API too easily. However, if I ever have to create an extremely complex regex that will require extensive maintenance later, I'm unaware of any viable alternatives.

Ideally, some other language will take this idea and make it cleaner.

This post was originally published on the PC-Doctor blog.

No comments: