How We Match Regular Expressions

How We Match Regular Expressions

This blog post is a quick summary of a big project; we want to go into a lot more detail (and you can always look at the code), but it makes sense for us to get the fundamentals out there. This post is aimed at developers who are looking to understand how Hyperscan works or get involved in the Hyperscan source base; you certainly don’t need to understand Hyperscan internals to make use of the library to match regular expressions.

Hyperscan is an automata-based (e.g. NFA/DFA) style approach rather than a back-tracking approach. The automata-based approach yields advantages and disadvantages: on the plus side, an automata-based approach is amenable to streaming and handling of multiple regular expressions. On the minus side, automata-based regular expression matching cannot handle some regular expression constructs easily, or at all – backreferences and arbitrary lookaround asserts are notable regular expression features that we do not support.

Hyperscan makes use of many different techniques to try to make the regular expression matching task tractable for large numbers of regular expressions. We have not found a single, elegant automata approach that handles arbitrary regular expressions in arbitrary number – although we are still looking! Instead, we have a lot of optimizations in order to try to extract the best possible performance whether we have one simple regular expression or tens of thousands.

Some of these techniques include:

  • Discovery of literal (fixed string) factors and decomposition of regular expressions into smaller chunks (which we call “engines”) separated by these literal factors.
  • These engines can be of many different types:
    • Deterministic Finite Automata (DFA)
    • Bit-parallel Glushkov Non-deterministic Finite Automata (NFA) engines
    • Custom engines for special cases (such as large bounded repeats).
      • These engines can take many different roles:
        • “Prefix” engines that precede our literal factors
        • “Suffix” engines that follow our literal factors
        • “Infix” engines that lie between two literal factors
        • “Outfix” engines that aren’t connected at all with literal factors (when no satisfactory factors can be found in a regular expression)
      • These engines can often run lazily or not at all to reduce overhead.
    • We merge smaller DFA/NFA engines into larger ones, where this can be done without performance loss.
    • SIMD “acceleration” of automata based scanning: where we can substitute relatively simple SIMD tests of our input for complex automata execution, we do it.
    • We use Intel SIMD instructions to handle larger-scale NFA and literal matching tasks: having 128 or 256 (or more) bits for a bit-parallel automaton is often helpful.
    • … and many more short-cuts to attempt to avoid doing expensive automata calculations that we ultimately won’t need.

Hyperscan is very much a work in progress — from 2008, we have been rapidly iterating the product and supporting customers. There are many inelegant bits to our code — sometimes because the real world seems to dictate inelegant solutions (e.g. corner cases), sometimes because we haven’t had a chance to clean things up.

Many parts of our codebase are earmarked for cleanup and streamlining, and if something looks peculiar to you, please call it out — it might look peculiar because it is peculiar and be amenable to a simple fix. Bear in mind the following, though:

Hyperscan is trying to accomplish a lot of goals (some of which are contradictory):

  • We’d like to be fast — providing high levels of performance under normal use cases and in boundary conditions (e.g. short writes, floods of single characters, worst case inputs)
  • We’d like to have a small bytecode (our read-only pattern database)
  • We’d like to have a small stream state if we’re running in streaming mode (our ‘per stream’ cost — if we use 192 bytes for a pattern set then it’s going to take 1920 Mbytes to support 10M streams, of course)

We also have some constraints:

  • Our run-time must be written in C, as some data-plane environments don’t support C++
  • We cannot make arbitrary requests for memory in our run-time; we can only work within our bytecode (read-only and fixed at compile time), scratch space and stream state space.
  • Our bytecode must be serializable to a flat representation and needs to be able to be moved from memory location to memory location (i.e. no pointers internally)
  • Our data structures — especially stream states and inputs — may be allocated ‘next door’ to parts of memory that cannot be read, so we must be careful to not use large data reads outside our input or stream state areas.
  • As a rule, we prefer to support constructs only if they can be implemented efficiently in streaming mode as well as block mode, and prefer to take an all-or-nothing approach to supporting whole classes of constructs.

Parts of the regular expression task are unsolved in Hyperscan due to a lack of demand thus far in our target markets — but are not necessarily unsolvable in the Hyperscan framework. We think that there are a lot of possibilities to improve Hyperscan’s support of some regular expression constructs (lookaround asserts, particularly) and regular expression functionality (capturing subexpressions and accurate simulation of the behavior of quantifiers in backtracking engines).