All About Programming: Regular Expression Matching Can Be Simple And Fast

Introduction

This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics:



Time to match `a?`ⁿ`a`ⁿ against `a`ⁿ

Let's use superscripts to denote string repetition, so that a?³a³ is shorthand for a?a?a?aaa. The two graphs plot the time required by each approach to match the regular expression a?ⁿaⁿ against the string aⁿ.

Notice that Perl requires over sixty seconds to match a 29-character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo. The Perl graph plots time in seconds, while the Thompson NFA graph plots time in microseconds: the Thompson NFA implementation is a million times faster than Perl when running on a miniscule 29-character string. The trends shown in the graph continue: the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 10¹⁵ years. (Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages. A more detailed graph later in this article presents data for other implementations.)

Read full article from Regular Expression Matching Can Be Simple And Fast

Regular Expression Matching Can Be Simple And Fast

Introduction

No comments:

Post a Comment

Labels

Popular Posts