Skip to content

Releases: coregx/coregex

v0.12.21: Tagged start states, zero-alloc API, 5 patterns faster than Rust

27 Mar 16:54
87d600b

Choose a tag to compare

Performance

DFA Engine

  • Tagged start states (Rust LazyStateID) — prefilter skip-ahead only at start state, fast transition in slow path (no getState/acceleration overhead)
  • DFA multiline $ fix — EndLine look-ahead (Rust determinize mod.rs:131-212)
  • Dead-state prefilter restart in searchEarliestMatch
  • Tiny NFA → UseDFA — patterns with <20 NFA states use bidirectional DFA (was PikeVM, 7x faster)

Allocation

  • 1100x fewer mallocs — flat buffer for FindAllIndex/FindAllSubmatchIndex
  • Local SearchState cache — atomic.Pointer, survives GC
  • Pool round-trip elimination in FindAll/Count

New Public API (zero-alloc)

  • AllIndex(b []byte) iter.Seq[[2]int] — zero-alloc iterator (Go proposal #61902)
  • AllStringIndex(s string) iter.Seq[[2]int] — string version
  • All(b []byte) iter.Seq[[]byte] — match content iterator
  • AllString(s string) iter.Seq[string] — string version
  • AppendAllIndex(dst [][2]int, b []byte, n int) [][2]int — buffer-reuse (strconv.Append* pattern)
  • AppendAllStringIndex(dst [][2]int, s string, n int) [][2]int — string version

Benchmarks (EPYC CI, 6MB input)

5 patterns faster than Rust (was 3 in v0.12.20):

Pattern vs stdlib vs Rust
IP address 685x 18.8x faster
Multiline PHP 299x 2.2x faster
Char class 11x 1.3x faster
Inner literal 881x 1.2x faster (NEW)
Version 263x 1.2x faster (NEW)

Zero-Alloc API (new methods vs stdlib-compat)

Method errors (33K matches) Alloc
FindAllStringIndex (stdlib) 8.2ms / 3890 KB 19 mallocs
AllIndex (iter.Seq) 5.9ms / 0 KB 0 mallocs
AppendAllIndex 5.5ms / 0 KB 0 mallocs

emails with AppendAllIndex: 2.0ms vs Rust 2.6ms — faster than Rust!

Fixed

  • DFA multiline $(?m)hello$ now matches before \n
  • isMatchWithPrefilter pfSkip — zx+ on "zzx" now correct

Full Changelog

v0.12.20...v0.12.21

v0.12.20: Premultiplied StateIDs, break-at-match, Phase 3 elimination

25 Mar 19:19
90d77fd

Choose a tag to compare

Performance

DFA Core — Premultiplied + Tagged StateIDs

  • Premultiplied StateIDsflatTrans[sid+classIdx] eliminates multiply from DFA hot loop (Rust LazyStateID)
  • Tagged StateIDs — match/dead/invalid flags in high bits, single IsTagged() branch replaces 3 comparisons
  • 4x loop unrolling in searchFirstAt, searchAt, searchEarliestMatch

DFA Core — Rust-aligned Determinize

  • 1-byte match delay — Rust determinize approach (mod.rs:254-286)
  • Break-at-match — Rust determinize::next break semantics (mod.rs:284), replaces filterStatesAfterMatch
  • Epsilon closure rewrite — add-on-pop DFS with reverse Split push, matching Rust sparse set insertion order
  • Phase 3 eliminated — bidirectional DFA reduced from 3-pass to 2-pass
  • Verified against Rust regex-automata find_fwd — identical results on all test patterns

Meta Engine

  • DFA direct FindAll path — skip meta prefilter layer
  • Anchored FindAll fast paths — first-byte rejection, pool overhead skip
  • BreakAtMatch config: true for forward DFA, false for reverse DFA

Prefilter

  • Memmem: Memchr(rareByte) + verify (Rust memchr::memmem approach)

Benchmarks

Cross-language comparison on 6MB input, AMD EPYC (regex-bench):

Pattern vs stdlib vs Rust
IP address 675x faster 18.5x faster than Rust
Multiline PHP 288x faster 2.0x faster than Rust
Char class [\w]+ 11x faster 1.3x faster than Rust
Inner literal 668x faster ~parity with Rust
Email 506x faster 1.8x slower
LangArena total (13 patterns) 30x faster 3.9x gap

No regressions vs v0.12.19 on any pattern.

Full Changelog

v0.12.19...v0.12.20

v0.12.19: zero-alloc captures, 95% less memory

24 Mar 16:59
ab4039b

Choose a tag to compare

What's Changed

Major memory optimization release — Rust-aligned architecture for PikeVM capture tracking and DFA cache management.

Performance

  • Zero-alloc FindSubmatch via dual SlotTable — Rust-style flat SlotTable replaces per-thread COW captures. Stack-based epsilon closure with RestoreCapture frames. FindAllSubmatch (50K matches): 554MB → 26MB (-95%), 3.3x faster
  • BoundedBacktracker visited limit — 256KB for UseNFA paths (Rust visited_capacity default). LangArena LogParser: 89MB → 25MB (-72%), RSS 353MB → 41MB (-88%)
  • Byte-based DFA cache limitCacheCapacityBytes (2MB default) replaces MaxStates. Matches Rust hybrid_cache_capacity
  • Remove dual transition storageState.transitions eliminated, DFACache.flatTrans only

Memory Impact

Workload v0.12.18 v0.12.19 Improvement
LangArena FindAll (13 patterns, 7MB) 89 MB alloc 25 MB -72%
LangArena RSS 353 MB 41 MB -88%
FindAllSubmatch (5 patterns, 50K matches) 554 MB 26 MB -95%

Documentation

Benchmark (regex-bench, AMD EPYC)

No regressions vs v0.12.18 on any pattern. 3 patterns faster than Rust (char_class 1.1x, ip 5.2x, multiline_php 1.2x).

Full Changelog: v0.12.18...v0.12.19

v0.12.18: Flat DFA + integrated prefilter — 3x from Rust

24 Mar 10:27
921d193

Choose a tag to compare

Major DFA architecture upgrade — flat transition table, integrated prefilter skip-ahead in DFA and PikeVM, 4x loop unrolling. Rust-aligned architecture.

35% faster than v0.12.14 baseline. 3x from Rust (was 9x).

Performance

  • Flat DFA transition table (Rust approach) — replaced stateList[id].transitions[class] (2 pointer chases) with flatTrans[sid*stride+class] (1 flat array access). Applied to all 8 DFA search functions.
  • DFA integrated prefilter skip-ahead — when DFA returns to start state, uses prefilter.Find() to skip ahead instead of byte-by-byte scanning. Reference: Rust hybrid/search.rs:232-258.
  • PikeVM integrated prefilter skip-ahead — prefilter inside NFA search loop (pikevm.rs:1293). Safe for partial-coverage prefilters.
  • 4x loop unrolling in searchFirstAt.

Fixed

  • NFA candidate loop: partialCoverage flag instead of IsComplete() guard.
  • DFA prefilter skip for incomplete prefilters (memmem/Teddy prefix-only).
  • 386 int overflow in flat table indexing (safeOffset with sid >= DeadState guard).

Benchmark (AMD EPYC, 6MB)

  • 4 patterns at Rust parity or faster: ip (5.6x faster), char_class (1.0x faster), inner_literal (~parity), multiline_php (~parity)
  • No regressions vs v0.12.14 on any platform (EPYC, Xeon, M1, 386)
  • Stdlib compat: 38/38 PASS

v0.12.17: Fix LogParser ARM64 regression

23 Mar 09:15
bc78fa7

Choose a tag to compare

Fix LogParser 7x ARM64 regression reported by @kostya (#124).

Fixed

  • Remove false DFA downgrade for (?m)^ — lazy DFA already handles multiline
    line anchors correctly via StartByteMap (identical to Rust). 4 LangArena patterns
    restored from NFA byte-by-byte to DFA. LangArena total: 2335ms → 185ms.

  • Restore partial prefilter for (?i) alternation overflow — literal extractor
    returned empty on overflow, killing prefilter for suspicious pattern (549 NFA
    states). Now trims to 3-byte prefixes (Rust approach). Also guards NFA candidate
    loop with IsComplete() to prevent correctness bugs from partial coverage.

  • Restore UseTeddy for (?m)^ patternsWrapLineAnchor makes Teddy safe
    for multiline anchors. http_methods on macOS ARM64: 89ms → 17ms.

Benchmark (AMD EPYC, 6MB)

  • No regressions vs v0.12.14 baseline on all LangArena patterns
  • multiline_php 1.3x faster than Rust
  • ip 5.6x faster than Rust
  • http_methods on macOS ARM64 restored to v0.12.14 level (17ms)

v0.12.16: Fix (?m)^ regression — WrapLineAnchor

21 Mar 17:40
afd9d8d

Choose a tag to compare

Fix (?m)^ multiline pattern regression reported by @kostya — LogParser 2s → 14s on M1.

Performance

  • WrapLineAnchor for (?m)^ patternsWrapIncomplete (v0.12.15) forced NFA
    verification for every prefilter candidate on multiline line-anchor patterns.
    WrapLineAnchor checks line-start position in O(1) (pos==0 || haystack[pos-1]=='\n'),
    keeping IsComplete()=true so Teddy returns matches directly without NFA.
    • LangArena methods: 755ms → <1ms
    • multiline_php on macOS ARM64: 63ms → 1.55ms (41x faster)

Fixed

  • Stdlib compatibility test: 38/38 PASS, 0 SKIP (was 34/38 with 4 skipped in v0.12.15)

v0.12.15: Per-goroutine DFA cache + 7 correctness fixes

21 Mar 07:49
9dca741

Choose a tag to compare

Per-goroutine DFA cache (Rust approach) + 7 correctness fixes + stdlib compatibility test.

Performance

  • Per-goroutine DFA cache — immutable DFA + mutable DFACache pooled via sync.Pool. Eliminates all concurrent data races
  • Pre-computed word boundary flags — 30% → 0.3% CPU for \b patterns
  • Integrated prefilter+DFA loop — single scan instead of two-pass
  • Strategy DFA caches in SearchState — eliminates per-call pool overhead in FindAll

Correctness (7 fixes)

  • .* newline boundary in ReverseSuffix/ReverseSuffixSet
  • Teddy ignoring anchors ((?m)^GET|POST)
  • (?m)^ multiline anchor DFA verification
  • .* FindAll with DFA cache clear (canMatchEmpty → UseNFA)
  • Partial prefilter on (?i) alternation overflow (Rust approach: no prefilter)
  • expandCaseFoldLiteral incomplete foldSets
  • Prefilter IsComplete with anchors (WrapIncomplete)

Testing

  • Stdlib compatibility test: 38 patterns compared against regexp — 38/38 PASS
  • Covers regex-bench + LangArena + edge cases (IsMatch, Find, FindAllIndex, Count)

Benchmark Results (AMD EPYC, 6MB input)

  • 4 patterns faster than Rust (ip 5.5x, multiline_php 1.3x, char_class 1.2x, anchored_php ~same)
  • anchored: 2.0x vs Rust (restored from 19x regression)
  • http_methods: 2.2x vs Rust (correct results, was returning false positives)
  • No regressions vs v0.12.14 baseline

v0.12.14: concurrent isMatchDFA safety fix

19 Mar 07:27
ee6b351

Choose a tag to compare

Fixed

  • isMatchDFA concurrent safety (#137) — prefilter candidate loop called shared lazy DFA concurrently from RunParallel. On ARM64 without SIMD prefilters: cache corruption, 1.7GB allocs, 1s+ per op on M2 Max. Fix: prefilter for fast rejection only, pooled PikeVM for verification.

Added

  • TestConcurrentCaseInsensitivePrefilter — 8 goroutines × 100 iterations, match + no-match paths. Catches the race on macOS with -race flag.

Reported by @tjbrains on Apple M2 Max.

Full Changelog: v0.12.13...v0.12.14

v0.12.13: FatTeddy fix, prefilter acceleration, AC v0.2.1

18 Mar 12:54
7a29fab

Choose a tag to compare

What's Changed

Performance

  • FatTeddy VPTEST hot loop — 8 instructions → 1 for candidate detection (24% faster scan)
  • FatTeddy batch FindAllPositions — one ASM call per 64KB chunk, eliminates Go→ASM round trips. FindAll 39ms → 22ms (1.8x)
  • Prefilter-accelerated isMatch/FindIndices — candidate loop with anchored DFA for large NFAs (>100 states). #137 match case: 176μs → 27μs
  • Cascading prefix trim (Rust-style) — >64 literals trimmed to fit Teddy. auth_attempts 34ms → 7ms

Fixed

  • FatTeddy AVX2 ANDL→ORL — lane combining missed single-lane patterns. (?i)get|post|put: 11456 → 34368 matches
  • Non-amd64 build — added hasAVX2 and batch stubs for macOS ARM64

Dependencies

  • ahocorasick v0.1.0 → v0.2.1 — DFA backend + SIMD prefilter, 11-22x throughput

LangArena LogParser total: 757ms → 144ms (5.3x faster). Gap to Rust: 13x → 2.5x.

Full Changelog: v0.12.12...v0.12.13

v0.12.12: prefix trimming for case-fold literals

17 Mar 20:55
ee2d823

Choose a tag to compare

What's Changed

Performance

  • Prefix trimming for case-fold expanded literals — when (?i) expansion produces >32 incomplete prefix literals, trims to 4-byte prefixes and deduplicates. Fits Teddy SIMD prefilter instead of slower Aho-Corasick. LangArena suspicious: 117ms → 5.7ms (20x faster, 2.6x from Rust, was 54x).

Full Changelog: v0.12.11...v0.12.12