Releases: coregx/coregex
v0.12.21: Tagged start states, zero-alloc API, 5 patterns faster than Rust
Performance
DFA Engine
- Tagged start states (Rust
LazyStateID) — prefilter skip-ahead only at start state, fast transition in slow path (no getState/acceleration overhead) - DFA multiline
$fix — EndLine look-ahead (Rust determinize mod.rs:131-212) - Dead-state prefilter restart in searchEarliestMatch
- Tiny NFA → UseDFA — patterns with <20 NFA states use bidirectional DFA (was PikeVM, 7x faster)
Allocation
- 1100x fewer mallocs — flat buffer for FindAllIndex/FindAllSubmatchIndex
- Local SearchState cache — atomic.Pointer, survives GC
- Pool round-trip elimination in FindAll/Count
New Public API (zero-alloc)
AllIndex(b []byte) iter.Seq[[2]int]— zero-alloc iterator (Go proposal #61902)AllStringIndex(s string) iter.Seq[[2]int]— string versionAll(b []byte) iter.Seq[[]byte]— match content iteratorAllString(s string) iter.Seq[string]— string versionAppendAllIndex(dst [][2]int, b []byte, n int) [][2]int— buffer-reuse (strconv.Append*pattern)AppendAllStringIndex(dst [][2]int, s string, n int) [][2]int— string version
Benchmarks (EPYC CI, 6MB input)
5 patterns faster than Rust (was 3 in v0.12.20):
| Pattern | vs stdlib | vs Rust |
|---|---|---|
| IP address | 685x | 18.8x faster |
| Multiline PHP | 299x | 2.2x faster |
| Char class | 11x | 1.3x faster |
| Inner literal | 881x | 1.2x faster (NEW) |
| Version | 263x | 1.2x faster (NEW) |
Zero-Alloc API (new methods vs stdlib-compat)
| Method | errors (33K matches) | Alloc |
|---|---|---|
| FindAllStringIndex (stdlib) | 8.2ms / 3890 KB | 19 mallocs |
| AllIndex (iter.Seq) | 5.9ms / 0 KB | 0 mallocs |
| AppendAllIndex | 5.5ms / 0 KB | 0 mallocs |
emails with AppendAllIndex: 2.0ms vs Rust 2.6ms — faster than Rust!
Fixed
- DFA multiline
$—(?m)hello$now matches before\n isMatchWithPrefilterpfSkip —zx+on "zzx" now correct
Full Changelog
v0.12.20: Premultiplied StateIDs, break-at-match, Phase 3 elimination
Performance
DFA Core — Premultiplied + Tagged StateIDs
- Premultiplied StateIDs —
flatTrans[sid+classIdx]eliminates multiply from DFA hot loop (RustLazyStateID) - Tagged StateIDs — match/dead/invalid flags in high bits, single
IsTagged()branch replaces 3 comparisons - 4x loop unrolling in
searchFirstAt,searchAt,searchEarliestMatch
DFA Core — Rust-aligned Determinize
- 1-byte match delay — Rust determinize approach (mod.rs:254-286)
- Break-at-match — Rust
determinize::nextbreak semantics (mod.rs:284), replacesfilterStatesAfterMatch - Epsilon closure rewrite — add-on-pop DFS with reverse Split push, matching Rust sparse set insertion order
- Phase 3 eliminated — bidirectional DFA reduced from 3-pass to 2-pass
- Verified against Rust
regex-automatafind_fwd— identical results on all test patterns
Meta Engine
- DFA direct FindAll path — skip meta prefilter layer
- Anchored FindAll fast paths — first-byte rejection, pool overhead skip
BreakAtMatchconfig:truefor forward DFA,falsefor reverse DFA
Prefilter
- Memmem: Memchr(rareByte) + verify (Rust
memchr::memmemapproach)
Benchmarks
Cross-language comparison on 6MB input, AMD EPYC (regex-bench):
| Pattern | vs stdlib | vs Rust |
|---|---|---|
| IP address | 675x faster | 18.5x faster than Rust |
| Multiline PHP | 288x faster | 2.0x faster than Rust |
Char class [\w]+ |
11x faster | 1.3x faster than Rust |
| Inner literal | 668x faster | ~parity with Rust |
| 506x faster | 1.8x slower | |
| LangArena total (13 patterns) | 30x faster | 3.9x gap |
No regressions vs v0.12.19 on any pattern.
Full Changelog
v0.12.19: zero-alloc captures, 95% less memory
What's Changed
Major memory optimization release — Rust-aligned architecture for PikeVM capture tracking and DFA cache management.
Performance
- Zero-alloc FindSubmatch via dual SlotTable — Rust-style flat SlotTable replaces per-thread COW captures. Stack-based epsilon closure with RestoreCapture frames. FindAllSubmatch (50K matches): 554MB → 26MB (-95%), 3.3x faster
- BoundedBacktracker visited limit — 256KB for UseNFA paths (Rust
visited_capacitydefault). LangArena LogParser: 89MB → 25MB (-72%), RSS 353MB → 41MB (-88%) - Byte-based DFA cache limit —
CacheCapacityBytes(2MB default) replacesMaxStates. Matches Rusthybrid_cache_capacity - Remove dual transition storage —
State.transitionseliminated,DFACache.flatTransonly
Memory Impact
| Workload | v0.12.18 | v0.12.19 | Improvement |
|---|---|---|---|
| LangArena FindAll (13 patterns, 7MB) | 89 MB alloc | 25 MB | -72% |
| LangArena RSS | 353 MB | 41 MB | -88% |
| FindAllSubmatch (5 patterns, 50K matches) | 554 MB | 26 MB | -95% |
Documentation
- New
docs/ARCHITECTURE.md— engine architecture, memory model, thread safety - Updated
docs/OPTIMIZATIONS.md— added #10 Dual SlotTable Capture Tracking
Benchmark (regex-bench, AMD EPYC)
No regressions vs v0.12.18 on any pattern. 3 patterns faster than Rust (char_class 1.1x, ip 5.2x, multiline_php 1.2x).
Full Changelog: v0.12.18...v0.12.19
v0.12.18: Flat DFA + integrated prefilter — 3x from Rust
Major DFA architecture upgrade — flat transition table, integrated prefilter skip-ahead in DFA and PikeVM, 4x loop unrolling. Rust-aligned architecture.
35% faster than v0.12.14 baseline. 3x from Rust (was 9x).
Performance
- Flat DFA transition table (Rust approach) — replaced
stateList[id].transitions[class](2 pointer chases) withflatTrans[sid*stride+class](1 flat array access). Applied to all 8 DFA search functions. - DFA integrated prefilter skip-ahead — when DFA returns to start state, uses
prefilter.Find()to skip ahead instead of byte-by-byte scanning. Reference: Rusthybrid/search.rs:232-258. - PikeVM integrated prefilter skip-ahead — prefilter inside NFA search loop (
pikevm.rs:1293). Safe for partial-coverage prefilters. - 4x loop unrolling in
searchFirstAt.
Fixed
- NFA candidate loop:
partialCoverageflag instead ofIsComplete()guard. - DFA prefilter skip for incomplete prefilters (memmem/Teddy prefix-only).
- 386
intoverflow in flat table indexing (safeOffsetwithsid >= DeadStateguard).
Benchmark (AMD EPYC, 6MB)
- 4 patterns at Rust parity or faster: ip (5.6x faster), char_class (1.0x faster), inner_literal (~parity), multiline_php (~parity)
- No regressions vs v0.12.14 on any platform (EPYC, Xeon, M1, 386)
- Stdlib compat: 38/38 PASS
v0.12.17: Fix LogParser ARM64 regression
Fix LogParser 7x ARM64 regression reported by @kostya (#124).
Fixed
-
Remove false DFA downgrade for
(?m)^— lazy DFA already handles multiline
line anchors correctly via StartByteMap (identical to Rust). 4 LangArena patterns
restored from NFA byte-by-byte to DFA. LangArena total: 2335ms → 185ms. -
Restore partial prefilter for
(?i)alternation overflow — literal extractor
returned empty on overflow, killing prefilter forsuspiciouspattern (549 NFA
states). Now trims to 3-byte prefixes (Rust approach). Also guards NFA candidate
loop withIsComplete()to prevent correctness bugs from partial coverage. -
Restore UseTeddy for
(?m)^patterns —WrapLineAnchormakes Teddy safe
for multiline anchors.http_methodson macOS ARM64: 89ms → 17ms.
Benchmark (AMD EPYC, 6MB)
- No regressions vs v0.12.14 baseline on all LangArena patterns
multiline_php1.3x faster than Rustip5.6x faster than Rusthttp_methodson macOS ARM64 restored to v0.12.14 level (17ms)
v0.12.16: Fix (?m)^ regression — WrapLineAnchor
Fix (?m)^ multiline pattern regression reported by @kostya — LogParser 2s → 14s on M1.
Performance
WrapLineAnchorfor(?m)^patterns —WrapIncomplete(v0.12.15) forced NFA
verification for every prefilter candidate on multiline line-anchor patterns.
WrapLineAnchorchecks line-start position in O(1) (pos==0 || haystack[pos-1]=='\n'),
keepingIsComplete()=trueso Teddy returns matches directly without NFA.- LangArena
methods: 755ms → <1ms multiline_phpon macOS ARM64: 63ms → 1.55ms (41x faster)
- LangArena
Fixed
- Stdlib compatibility test: 38/38 PASS, 0 SKIP (was 34/38 with 4 skipped in v0.12.15)
v0.12.15: Per-goroutine DFA cache + 7 correctness fixes
Per-goroutine DFA cache (Rust approach) + 7 correctness fixes + stdlib compatibility test.
Performance
- Per-goroutine DFA cache — immutable DFA + mutable DFACache pooled via
sync.Pool. Eliminates all concurrent data races - Pre-computed word boundary flags — 30% → 0.3% CPU for
\bpatterns - Integrated prefilter+DFA loop — single scan instead of two-pass
- Strategy DFA caches in SearchState — eliminates per-call pool overhead in FindAll
Correctness (7 fixes)
.*newline boundary in ReverseSuffix/ReverseSuffixSet- Teddy ignoring anchors (
(?m)^GET|POST) (?m)^multiline anchor DFA verification.*FindAll with DFA cache clear (canMatchEmpty→ UseNFA)- Partial prefilter on
(?i)alternation overflow (Rust approach: no prefilter) expandCaseFoldLiteralincomplete foldSets- Prefilter
IsCompletewith anchors (WrapIncomplete)
Testing
- Stdlib compatibility test: 38 patterns compared against
regexp— 38/38 PASS - Covers regex-bench + LangArena + edge cases (IsMatch, Find, FindAllIndex, Count)
Benchmark Results (AMD EPYC, 6MB input)
- 4 patterns faster than Rust (ip 5.5x, multiline_php 1.3x, char_class 1.2x, anchored_php ~same)
- anchored: 2.0x vs Rust (restored from 19x regression)
- http_methods: 2.2x vs Rust (correct results, was returning false positives)
- No regressions vs v0.12.14 baseline
v0.12.14: concurrent isMatchDFA safety fix
Fixed
isMatchDFAconcurrent safety (#137) — prefilter candidate loop called shared lazy DFA concurrently fromRunParallel. On ARM64 without SIMD prefilters: cache corruption, 1.7GB allocs, 1s+ per op on M2 Max. Fix: prefilter for fast rejection only, pooled PikeVM for verification.
Added
TestConcurrentCaseInsensitivePrefilter— 8 goroutines × 100 iterations, match + no-match paths. Catches the race on macOS with-raceflag.
Reported by @tjbrains on Apple M2 Max.
Full Changelog: v0.12.13...v0.12.14
v0.12.13: FatTeddy fix, prefilter acceleration, AC v0.2.1
What's Changed
Performance
- FatTeddy VPTEST hot loop — 8 instructions → 1 for candidate detection (24% faster scan)
- FatTeddy batch FindAllPositions — one ASM call per 64KB chunk, eliminates Go→ASM round trips. FindAll 39ms → 22ms (1.8x)
- Prefilter-accelerated isMatch/FindIndices — candidate loop with anchored DFA for large NFAs (>100 states). #137 match case: 176μs → 27μs
- Cascading prefix trim (Rust-style) — >64 literals trimmed to fit Teddy. auth_attempts 34ms → 7ms
Fixed
- FatTeddy AVX2 ANDL→ORL — lane combining missed single-lane patterns.
(?i)get|post|put: 11456 → 34368 matches - Non-amd64 build — added hasAVX2 and batch stubs for macOS ARM64
Dependencies
- ahocorasick v0.1.0 → v0.2.1 — DFA backend + SIMD prefilter, 11-22x throughput
LangArena LogParser total: 757ms → 144ms (5.3x faster). Gap to Rust: 13x → 2.5x.
Full Changelog: v0.12.12...v0.12.13
v0.12.12: prefix trimming for case-fold literals
What's Changed
Performance
- Prefix trimming for case-fold expanded literals — when
(?i)expansion produces >32 incomplete prefix literals, trims to 4-byte prefixes and deduplicates. Fits Teddy SIMD prefilter instead of slower Aho-Corasick. LangArenasuspicious: 117ms → 5.7ms (20x faster, 2.6x from Rust, was 54x).
Full Changelog: v0.12.11...v0.12.12