Monday, April 23, 2007

Regular Expressions - how good theory is ignored in popular software

You think the regular expression implementation in Java, Perl, Python, PHP, Ruby and PCRE (which is a C library) should have been refined many many times and thus highly optimized? Think again.

http://swtch.com/~rsc/regexp/regexp1.html

The title of the article is "Regular Expression Matching Can Be Simple And Fast", but what's more interesting is the subtitle - "(but is slow in Java, Perl, PHP, Python, Ruby, ...)". Slow, how slow? Look at the first graph of the article, for some pattern matching inputs, Perl 5.8.7's built-in regular expression matching is millions times slower than a 40-year-old algorithm.

How can that happen? Well... it could be argued that the expression used in the example is a pathological case. But is it a pathological problem in theory? i.e. not belonging to P, or belonging to P with a very large exponent? Well, obviously not. Otherwise, the 40-year-old algorithm wouldn't be able to perform the matching quickly as well.

What actually happened here was this... all the popular programming language developers (Java, Python, Perl, PHP, etc.) copied/borrowed their implementation from a popular extended regular expression matching algorithm that was known to be "fast enough", but not known to be provably fast. 40+ years of theories of finite automata went into the trash bin when programmers (including the guy who invented the correct algorithm 40 years ago!) needed to release softwares fast and neglected to spend time to think about the mathematics behind.

The regular expression engine that the article's author described was only a very simple one, however. Can it be expanded to processing modern extended regular expressions without going into the same performance hell of Perl, PCRE, Python, etc.? The author gave some justifications that it could, but he was very light on the details. Even if he has missed out some details that makes his proposal infeasible, however, it still stands that the regex engines we're using every day are far from optimal.

No comments: