@samjfrank's devblog

Programmer, circuit builder, gamedev, and webdev.
Graduate of computer science and computer engineering from WUSTL. Currently working at @Uber.

One interesting problem I’ve run into recently is that of regular expression mapping. A prominent example is web micro-frameworks like Flask and Bottle, which use Regular Expressions (or some similar grammar definition) to define “routes” to functions. When an HTTP request comes in, the URL is passed on to some “routing” method, which runs down a list of the regular expressions and tries to match the URL to a route, and then call the appropriate function.

And obvious issue with this approach is that is takes O(n) time to check a route, where n is the number of possible routes. This is fine for smaller applications, but when you have many defined routes it can get quite unruly. Other, more vast datasets may require some kind of regex -> value mapping as well, and obviously the O(n) approach continues to be subpar. 

So, I’ve come up with a little technique to get around this limitation. I’m not sure if this technique is widespread, but it works for me (with some caveats of course).

The fundamental basis for this technique is the understanding that regular expressions represent a language grammar. As such, they are always convertible into a Finite State Automata by definition. Sadly, the vast majority of regex implementations do not employ actual FSM conversion to save some cycles. However, for this technique FSM conversion is a requirement. 

The trick is, because each regular expression can be converted into an FSM, we can simply combine all of our routing regular expressions into a singular FSM and keep track of which accept states map to what values (where the hash table comes in). Then, when we want to check a path (or other input) to find the correct value, we simply run through our FSM, find the correct accept state, and return the associated value with the state. Now our complexity is O(k) where k is the length of the input string (always the complexity for running an FSM)! 

Okay, so there are, sadly, still some practical limitations with this technique:

  • You have to generate the route FSM before hand, and anytime you need to add a new route, you must re-generate it
  • It’s expensive to generate the routing FSM
  • For small applications, n > k quite often
  • Modern regex implementations do not represent themselves as an FSM, so you must develop your own FSM solution that you can hack the routing into (or use the one I’ll eventually develop)

Here’s the basic algorithm:

  • Use Thompson’s construction to convert each regex -> NFA
  • Use Subset construction to convert each NFA -> FSM
  • Assign state values using a common increment (so each state in each FSM has a unique ID), and create entires for the accept states mapping to their respective values
  • Combine all of the FSMs into a singular FSM
  • When matching a key, run through the FSM until you hit an accept state, look up the ID of that accept state and return the corresponding value

I’ll hopefully have a demo up soon!

via devblog on 7/27/2015