A header-only regex Engine with zero-cost abstraction and strictly O(n) time complexity.
- ✔️ Strict
$O(|s|)$ matching complexity, where$|s|$ denotes the length of the subject string. - ✔️ Compile-time construction of automata for constant string regular expressions, leaving no runtime overhead.
- ✔️ Single-header file; just
includeto use. - ✔️ Supports capture groups and replacement based on capture groups.
- ✔️ Uses Brzozowski derivatives with type-level functional metaprogramming to ensure fast compilation and almost always produce minimal automata, and it’s cool!
- ✔️ Supports all visible ASCII characters as the alphabet.
- ✔️ Supports standard regular expressions (concatenation, alternation
|, Kleene star*, parentheses()); supports+and?; supports wildcard.; supports character classes (like[^a-z012]); supports escapes (\n,\t,\d,\s,\w,\x[HEX],\[,\],\*, ...); supports quantifiers ({n},{n,},{,m},{n,m}). - ❌️ Does not support zero-width assertions.
- ❌️ Does not support backreferences.
- ❌️ Capture disambiguation rules do not conform to POSIX or Perl standards.
=== Match ===
pattern: a* pattern_len: 2 str: aaaaaaaa....aaaaaaaa str_len: 100000 time: 127us
pattern: a*b pattern_len: 3 str: aaaaaaaa....aaaaaaab str_len: 100001 time: 127us
pattern: (ab)* pattern_len: 5 str: abababab....abababab str_len: 100000 time: 134us
pattern: (a?b)+ pattern_len: 6 str: abababab....abababab str_len: 100000 time: 128us
pattern: [a-z]+ pattern_len: 6 str: mmmmmmmm....mmmmmmmm str_len: 100000 time: 131us
pattern: [a-z0-9]+ pattern_len: 9 str: a1c3e5g7....w3y5a7c9 str_len: 100000 time: 127us
pattern: (a|b)*c pattern_len: 7 str: aaaaaaaa....aaaaaaaa str_len: 10000 time: 12us
pattern: (a|b)*a(a|b)*a pattern_len: 14 str: bbbbbbbb....bbbbbbba str_len: 10001 time: 13us
pattern: (a*)* pattern_len: 5 str: aaaaaaaa....aaaaaaaa str_len: 10000 time: 12us
pattern: (a*(b*)*)* pattern_len: 10 str: bbbbbbbb....bbbbbbbb str_len: 10000 time: 12us
pattern: (a|b)*a(a|b)*b pattern_len: 14 str: aaaaaaaa....bbbbbbbb str_len: 20001 time: 12us
pattern: ([a-z])*z pattern_len: 9 str: xxxxxxxx....xxxxxxxx str_len: 10000 time: 12us
pattern: ((((0*1)0)....((10*1)0)* pattern_len: 321 str: 000000 str_len: 6 time: 0us
pattern: ((((0*1)0)....((10*1)0)* pattern_len: 321 str: 000001 str_len: 6 time: 0us
pattern: ((((0*1)0)....((10*1)0)* pattern_len: 321 str: 000010 str_len: 6 time: 0us
pattern: ((((0*1)0)....((10*1)0)* pattern_len: 321 str: 000011 str_len: 6 time: 0us
pattern: ((((0*1)0)....((10*1)0)* pattern_len: 321 str: 000100 str_len: 6 time: 0us
pattern: ((((0*1)0)....((10*1)0)* pattern_len: 321 str: 000101 str_len: 6 time: 0us
pattern: ((((0*1)0)....((10*1)0)* pattern_len: 321 str: 000110 str_len: 6 time: 0us
=== Replace ===
pattern: a+Xb+ pattern_len: 5 replace_rule: $0mid$0 str: aaaaaaa....bbbbbbb str_len: 200001 time: 952us
pattern: (x*)end pattern_len: 7 replace_rule: $1-END str: xxxxxxx....xxxxend str_len: 10003 time: 75us
pattern: start(x*)end pattern_len: 12 replace_rule: $1 str: startxx....xxxxend str_len: 10008 time: 78us
pattern: (a|b)+ pattern_len: 6 replace_rule: $0 str: aaaaaaa....aaaaaaa str_len: 300000 time: 2892us
pattern: (a+)+b pattern_len: 6 replace_rule: $0 str: aaaaaaa....aaaaaab str_len: 10001 time: 62us
pattern: (a?)+b pattern_len: 6 replace_rule: $0 str: aaaaaaa....aaaaaab str_len: 10001 time: 272us
pattern: (a|b)+c pattern_len: 7 replace_rule: $0 str: aaaaaaa....aaaaaac str_len: 10001 time: 161us
pattern: (ab|a)*c pattern_len: 8 replace_rule: $0 str: aaaaaaa....aaaaaac str_len: 10001 time: 167us
pattern: (a+)(a+)b pattern_len: 9 replace_rule: $1|$2 str: aaaaaaa....aaaaaab str_len: 10001 time: 336us
pattern: (a|ab)+b pattern_len: 8 replace_rule: OK str: aaaaaaa....aaaaaab str_len: 10001 time: 157us
pattern: (a*)b pattern_len: 5 replace_rule: $1B str: aaaaaaa....aaaaaab str_len: 10001 time: 61us
For very long strings, catastrophic backtracking cases, very long patterns, and very complex patterns the engine still reliably achieves O(n) performance; the same test cases on backtracking engines almost always blow up (for example the famous Cloudflare incident).
Compiling all of these (many more than shown; see tests/) complex patterns is controllable and acceptable:
> make clean && time make -j30
real 0m13.845s
user 1m19.445s
sys 0m3.850sPrototype:
template<onre::impl::FixedString Pattern>
inline bool onre::match(std::string_view str) noexcept;
template<onre::impl::FixedString Pattern>
std::string onre::replace(std::string_view replace_rule, std::string_view str) noexcept;onre::match will return whether str can be matched by Pattern; onre::replace will return the string obtained after performing replacements according to replace_rule if str can be matched by Pattern. The rule for replace_rule is: $N denotes the $0 denotes the string itself; $$ denotes $.
If onre::replace cannot match, the result is undefined; if the replacement rule is invalid, onre::replace returns an empty string. For scenarios where a match might fail, you should first use onre::match to check whether it matches.
Usage example:
#include "onre.hpp"
#include <string>
void f() {
bool result = onre::match<"((ab)*|c*|b)(@\\.)?">("abab");
std::string replaced = onre::replace<"ab(.*)ab">("$0, $1, $$", "ababab");
// result = true; replaced = "ababab, ab, $"
}onre::match and onre::replace are thread-safe.
Instantiating onre::match or onre::replace anywhere in the code will trigger compile-time expansion and increase compile time even if the instance can never be executed at runtime. The same pattern instantiated multiple times within one translation unit is instantiated only once; different translation units will each instantiate it separately, so moving complex patterns into a single translation unit can greatly reduce compile time.
#include "onre.hpp"
bool is_valid_email(std::string email) {
// only compiled once
return onre::match<"[-a-zA-Z0-9.]+@([-a-zA-Z0-9]+.)+[-a-zA-Z0-9]+">(email);
}For programs that use very complex expressions, the compile options should include large bracket/template depth limits -fbracket-depth=[A BIG NUMBER] -ftemplate-depth=[A BIG NUMBER] to allow deeper compile-time recursion.
Unlike POSIX (IEEE Std 1003.1-2001) leftmost-longest semantics, ONRE follows a rightmost-longest disambiguation rule. Concretely, captures inside closures match the last possible substring that completes the match. For example, when matching ((a*)(a*)b)* against aaababaaaab, the results are:
| Capture | POSIX | ONRE |
|---|---|---|
$1 (((a*)(a*)b)) |
aaab |
aaaab |
$2 (first (a*)) |
aaa |
aaaa |
$3 (second (a*)) |
Equivalently:
.-- $1 --.
| |
+-$2--. |
| | |
V V V POSIX
a a a b a b a a a a b ---------
^ ^ ^ ONRE
| | |
+-- $2 --` |
| |
`-----$1----`
Other examples:
onre::replace<"((a*)b)*">("$2", "aabb") => "" // aab-()-b
onre::replace<"(a|ab)+b">("$1", "abab") => "a" // ab-(a)-bFor a regular expression
—that is, after consuming
where
with recursive definition:
Based on this, we can construct a deterministic finite automaton (DFA) for expression
We recursively compute the derivatives of the regular expression until no new expressions are generated, completing the construction.
Consider
If there are two actions, we define their concatenation
Formally, a slot configuration is an
If a set
-
$\omega \in A, \text{where } \omega(S) = S$ ; -
$\text{set}(i, x) \in A, \forall i \in \lbrace 0, 1, ..., N-1 \rbrace, x \in \mathbb N, \text{where } \text{set}(i, x)(s_0, s_1, ..., s_i, ..., s_{N-1}) = (s_0, s_1, ..., x, ..., s_{N-1})$ ; -
$\alpha \cdot \beta \in A, \forall \alpha, \beta \in A, \text{where } \alpha \cdot \beta (T) = \beta(\alpha(T)) $ ,
then
If we allow a symbol <i> to appear in a regular expression, representing a zero-width (non-consuming) match that executes the action
Formally, given a natural number
-
$\emptyset \in E$ ; -
$\epsilon \in E$ ; - $ \langle i \rangle \in E, \forall i \in \lbrace 0, 1, 2, ..., N-1 \rbrace $;
-
$a \in E, \forall a \in \Sigma$ ; -
$R|S \in E, \forall R, S \in E$ ; -
$RS \in E, \forall R, S \in E$ ; -
$R^\ast \in E, \forall R$ ,
then
Each extended regular expression defines a language, with
At the start of matching, we initialize the slot state as
For example, the POSIX regular expression a(a*)a, which captures the substring excluding the first and last characters, can be viewed as the extended expression
For an extended regular expression
For an extended regular expression
Building upon the NFA, if we assign an action to every transition—executed before consuming the character and performing the state transition—we obtain a Tagged Nondeterministic Finite Automaton (TNFA). The accepting states also have associated accept actions, executed when the automaton accepts a string.
Formally, a TNFA is a 7-tuple
We can construct a TNFA equivalent to an extended regular expression
Note that
Backtracking simulation of a TNFA has exponential complexity—no better than a backtracking regex engine (and potentially slower). To constrain complexity, we use an active-state NFA simulation method and introduce a heuristic arbitration mechanism to resolve ambiguities.
Conceptually, given a TNFA and an accepted string, every path from the start to an accepting state corresponds to one possible slot configuration—the output of the extended regular expression for that string.
For instance, in (a*)(a*) (i.e., a*, and
Now, consider the algorithm:
The TNFA has bool is_active[S] and number slots[S][N],
where slots[i] stores the slot configuration of state
Initially:
is_active <- all false
is_active[0] <- true
slots[i][j] <- all -1
Then, for each consumed character c:
bool new_active[S] <- all false
number new_slots[S][N] <- all -1
for i = 0, 1, 2, ..., S-1
if ! is_active[i] continue
for R', a in delta(q_i)
dest_state = state idx of R'
new_slots_row <- a(slots[i])
if ! new_active[dest_state]
new_slots[dest_state] = new_slots_row
else
new_slots[dest_state] = arbitration(new_slots[dest_state], new_slots_row)
new_active[dest_state] = 1
if new_active all false
failed to match
active <- new_active
slots <- new_slots
If no characters remain:
number result_slots[N] = null
for q in Q
if ! is_active[q] continue
if ! is_accept(q) continue
for a in A_accept(q)
if result_slots == null
result_slots = a(slots[q])
else
result_slots = arbitration(result_slots, a(slots[q]))
if result_slots = null
failed to match
From the code, we see that the algorithm continually advances the active set and maintains a slot configuration for each state. During propagation, the destination state is marked active, and its slots are updated by executing the transition’s action. At convergence points (when multiple paths activate the same state), an arbitration function decides which configuration to keep, permanently discarding others.
Consider the correctness first. Each state’s slot configuration corresponds to some valid path from the start to that state. Therefore, any output slot configuration must correspond to a valid path and is thus consistent with the original extended regular expression—ensuring correctness.
On the arbitration function, even if the arbitration function always picks the first, the second, or randomly, the final slot configuration will still represent a possible match. Thus, when there is only one possible configuration (i.e., the expression is unambiguous), the output always correctly represents capture groups. For ambiguous expressions, because the TNFA structure does not fully mirror the original regex semantics, it is difficult to design an arbitration rule perfectly matching disambiguation criteria (e.g., greedy longest). However, we can approximate it through heuristic indicators and algorithms to favor the longest possible match.
make test -j20Verified lowest compiler versions:
Version >= 12
--std=c++20 or higher (if supported).
If recursion depth causes compile failures, add -fbracket-depth=[A BIG NUMBER] -ftemplate-depth=[A BIG NUMBER] to allow deeper compile-time recursion.
Version >= 12
--std=c++20 or higher (if supported).
If recursion depth causes compile failures, add -ftemplate-depth=[A BIG NUMBER] to allow deeper compile-time recursion.
Note that g++ compiles significantly slower than clang++ when compiling complex patterns.
❗ To keep compilation time acceptable, the current implementation does not use semantic equivalence when merging states; it uses syntactic equivalence instead. As a result it does not support some expressions that have multiple interpretations inside closures, examples include (ab|ababab|c)*, (aa|aaa)*, (a*|aa)*, or (a*aa)*. Instead, equivalent forms like (ab|c)*, |aaa*, a*, or (aa)* are preferred; otherwise compilation may fail. An Exact mode might be added in future to relax this, but compilation time will predictably become unacceptable.
🩹 The expressions that cause the issue are rarely used in normal circumstances, and because patterns are static they are not exploitable for attack vectors. Fixing this is not currently planned.