Releases · coregx/coregex

27 Mar 16:54

kolkov

v0.12.21

87d600b

v0.12.21: Tagged start states, zero-alloc API, 5 patterns faster than Rust Latest

Latest

Performance

DFA Engine

Tagged start states (Rust LazyStateID) — prefilter skip-ahead only at start state, fast transition in slow path (no getState/acceleration overhead)
DFA multiline $ fix — EndLine look-ahead (Rust determinize mod.rs:131-212)
Dead-state prefilter restart in searchEarliestMatch
Tiny NFA → UseDFA — patterns with <20 NFA states use bidirectional DFA (was PikeVM, 7x faster)

Allocation

1100x fewer mallocs — flat buffer for FindAllIndex/FindAllSubmatchIndex
Local SearchState cache — atomic.Pointer, survives GC
Pool round-trip elimination in FindAll/Count

New Public API (zero-alloc)

AllIndex(b []byte) iter.Seq[[2]int] — zero-alloc iterator (Go proposal #61902)
AllStringIndex(s string) iter.Seq[[2]int] — string version
All(b []byte) iter.Seq[[]byte] — match content iterator
AllString(s string) iter.Seq[string] — string version
AppendAllIndex(dst [][2]int, b []byte, n int) [][2]int — buffer-reuse (strconv.Append* pattern)
AppendAllStringIndex(dst [][2]int, s string, n int) [][2]int — string version

Benchmarks (EPYC CI, 6MB input)

5 patterns faster than Rust (was 3 in v0.12.20):

Pattern	vs stdlib	vs Rust
IP address	685x	18.8x faster
Multiline PHP	299x	2.2x faster
Char class	11x	1.3x faster
Inner literal	881x	1.2x faster (NEW)
Version	263x	1.2x faster (NEW)

Zero-Alloc API (new methods vs stdlib-compat)

Method	errors (33K matches)	Alloc
FindAllStringIndex (stdlib)	8.2ms / 3890 KB	19 mallocs
AllIndex (iter.Seq)	5.9ms / 0 KB	0 mallocs
AppendAllIndex	5.5ms / 0 KB	0 mallocs

emails with AppendAllIndex: 2.0ms vs Rust 2.6ms — faster than Rust!

Fixed

DFA multiline $ — (?m)hello$ now matches before \n
isMatchWithPrefilter pfSkip — zx+ on "zzx" now correct

Full Changelog

v0.12.20...v0.12.21

Assets 2

25 Mar 19:19

kolkov

v0.12.20

90d77fd

v0.12.20: Premultiplied StateIDs, break-at-match, Phase 3 elimination

Performance

DFA Core — Premultiplied + Tagged StateIDs

Premultiplied StateIDs — flatTrans[sid+classIdx] eliminates multiply from DFA hot loop (Rust LazyStateID)
Tagged StateIDs — match/dead/invalid flags in high bits, single IsTagged() branch replaces 3 comparisons
4x loop unrolling in searchFirstAt, searchAt, searchEarliestMatch

DFA Core — Rust-aligned Determinize

1-byte match delay — Rust determinize approach (mod.rs:254-286)
Break-at-match — Rust determinize::next break semantics (mod.rs:284), replaces filterStatesAfterMatch
Epsilon closure rewrite — add-on-pop DFS with reverse Split push, matching Rust sparse set insertion order
Phase 3 eliminated — bidirectional DFA reduced from 3-pass to 2-pass
Verified against Rust regex-automata find_fwd — identical results on all test patterns

Meta Engine

DFA direct FindAll path — skip meta prefilter layer
Anchored FindAll fast paths — first-byte rejection, pool overhead skip
BreakAtMatch config: true for forward DFA, false for reverse DFA

Prefilter

Memmem: Memchr(rareByte) + verify (Rust memchr::memmem approach)

Benchmarks

Cross-language comparison on 6MB input, AMD EPYC (regex-bench):

Pattern	vs stdlib	vs Rust
IP address	675x faster	18.5x faster than Rust
Multiline PHP	288x faster	2.0x faster than Rust
Char class `[\w]+`	11x faster	1.3x faster than Rust
Inner literal	668x faster	~parity with Rust
Email	506x faster	1.8x slower
LangArena total (13 patterns)	30x faster	3.9x gap

No regressions vs v0.12.19 on any pattern.

Full Changelog

v0.12.19...v0.12.20

Assets 2

24 Mar 16:59

kolkov

v0.12.19

ab4039b

v0.12.19: zero-alloc captures, 95% less memory

What's Changed

Major memory optimization release — Rust-aligned architecture for PikeVM capture tracking and DFA cache management.

Performance

Zero-alloc FindSubmatch via dual SlotTable — Rust-style flat SlotTable replaces per-thread COW captures. Stack-based epsilon closure with RestoreCapture frames. FindAllSubmatch (50K matches): 554MB → 26MB (-95%), 3.3x faster
BoundedBacktracker visited limit — 256KB for UseNFA paths (Rust visited_capacity default). LangArena LogParser: 89MB → 25MB (-72%), RSS 353MB → 41MB (-88%)
Byte-based DFA cache limit — CacheCapacityBytes (2MB default) replaces MaxStates. Matches Rust hybrid_cache_capacity
Remove dual transition storage — State.transitions eliminated, DFACache.flatTrans only

Memory Impact

Workload	v0.12.18	v0.12.19	Improvement
LangArena FindAll (13 patterns, 7MB)	89 MB alloc	25 MB	-72%
LangArena RSS	353 MB	41 MB	-88%
FindAllSubmatch (5 patterns, 50K matches)	554 MB	26 MB	-95%

Documentation

New docs/ARCHITECTURE.md — engine architecture, memory model, thread safety
Updated docs/OPTIMIZATIONS.md — added #10 Dual SlotTable Capture Tracking

Benchmark (regex-bench, AMD EPYC)

No regressions vs v0.12.18 on any pattern. 3 patterns faster than Rust (char_class 1.1x, ip 5.2x, multiline_php 1.2x).

Full Changelog: v0.12.18...v0.12.19

Assets 2

24 Mar 10:27

kolkov

v0.12.18

921d193

v0.12.18: Flat DFA + integrated prefilter — 3x from Rust

Major DFA architecture upgrade — flat transition table, integrated prefilter skip-ahead in DFA and PikeVM, 4x loop unrolling. Rust-aligned architecture.

35% faster than v0.12.14 baseline. 3x from Rust (was 9x).

Performance

Flat DFA transition table (Rust approach) — replaced stateList[id].transitions[class] (2 pointer chases) with flatTrans[sid*stride+class] (1 flat array access). Applied to all 8 DFA search functions.
DFA integrated prefilter skip-ahead — when DFA returns to start state, uses prefilter.Find() to skip ahead instead of byte-by-byte scanning. Reference: Rust hybrid/search.rs:232-258.
PikeVM integrated prefilter skip-ahead — prefilter inside NFA search loop (pikevm.rs:1293). Safe for partial-coverage prefilters.
4x loop unrolling in searchFirstAt.

Fixed

NFA candidate loop: partialCoverage flag instead of IsComplete() guard.
DFA prefilter skip for incomplete prefilters (memmem/Teddy prefix-only).
386 int overflow in flat table indexing (safeOffset with sid >= DeadState guard).

Benchmark (AMD EPYC, 6MB)

4 patterns at Rust parity or faster: ip (5.6x faster), char_class (1.0x faster), inner_literal (~parity), multiline_php (~parity)
No regressions vs v0.12.14 on any platform (EPYC, Xeon, M1, 386)
Stdlib compat: 38/38 PASS

Assets 2

23 Mar 09:15

kolkov

v0.12.17

bc78fa7

v0.12.17: Fix LogParser ARM64 regression

Fix LogParser 7x ARM64 regression reported by @kostya (#124).

Fixed

Remove false DFA downgrade for (?m)^ — lazy DFA already handles multiline
line anchors correctly via StartByteMap (identical to Rust). 4 LangArena patterns
restored from NFA byte-by-byte to DFA. LangArena total: 2335ms → 185ms.
Restore partial prefilter for (?i) alternation overflow — literal extractor
returned empty on overflow, killing prefilter for suspicious pattern (549 NFA
states). Now trims to 3-byte prefixes (Rust approach). Also guards NFA candidate
loop with IsComplete() to prevent correctness bugs from partial coverage.
Restore UseTeddy for (?m)^ patterns — WrapLineAnchor makes Teddy safe
for multiline anchors. http_methods on macOS ARM64: 89ms → 17ms.

Benchmark (AMD EPYC, 6MB)

No regressions vs v0.12.14 baseline on all LangArena patterns
multiline_php 1.3x faster than Rust
ip 5.6x faster than Rust
http_methods on macOS ARM64 restored to v0.12.14 level (17ms)

Contributors

kostya

Assets 2

21 Mar 17:40

kolkov

v0.12.16

afd9d8d

v0.12.16: Fix (?m)^ regression — WrapLineAnchor

Fix (?m)^ multiline pattern regression reported by @kostya — LogParser 2s → 14s on M1.

Performance

WrapLineAnchor for (?m)^ patterns — WrapIncomplete (v0.12.15) forced NFA
verification for every prefilter candidate on multiline line-anchor patterns.
WrapLineAnchor checks line-start position in O(1) (pos==0 || haystack[pos-1]=='\n'),
keeping IsComplete()=true so Teddy returns matches directly without NFA.
- LangArena methods: 755ms → <1ms
- multiline_php on macOS ARM64: 63ms → 1.55ms (41x faster)

Fixed

Stdlib compatibility test: 38/38 PASS, 0 SKIP (was 34/38 with 4 skipped in v0.12.15)

Contributors

kostya

Assets 2

21 Mar 07:49

kolkov

v0.12.15

9dca741

v0.12.15: Per-goroutine DFA cache + 7 correctness fixes

Per-goroutine DFA cache (Rust approach) + 7 correctness fixes + stdlib compatibility test.

Performance

Per-goroutine DFA cache — immutable DFA + mutable DFACache pooled via sync.Pool. Eliminates all concurrent data races
Pre-computed word boundary flags — 30% → 0.3% CPU for \b patterns
Integrated prefilter+DFA loop — single scan instead of two-pass
Strategy DFA caches in SearchState — eliminates per-call pool overhead in FindAll

Correctness (7 fixes)

.* newline boundary in ReverseSuffix/ReverseSuffixSet
Teddy ignoring anchors ((?m)^GET|POST)
(?m)^ multiline anchor DFA verification
.* FindAll with DFA cache clear (canMatchEmpty → UseNFA)
Partial prefilter on (?i) alternation overflow (Rust approach: no prefilter)
expandCaseFoldLiteral incomplete foldSets
Prefilter IsComplete with anchors (WrapIncomplete)

Testing

Stdlib compatibility test: 38 patterns compared against regexp — 38/38 PASS
Covers regex-bench + LangArena + edge cases (IsMatch, Find, FindAllIndex, Count)

Benchmark Results (AMD EPYC, 6MB input)

4 patterns faster than Rust (ip 5.5x, multiline_php 1.3x, char_class 1.2x, anchored_php ~same)
anchored: 2.0x vs Rust (restored from 19x regression)
http_methods: 2.2x vs Rust (correct results, was returning false positives)
No regressions vs v0.12.14 baseline

Assets 2

19 Mar 07:27

kolkov

v0.12.14

ee6b351

v0.12.14: concurrent isMatchDFA safety fix

Fixed

isMatchDFA concurrent safety (#137) — prefilter candidate loop called shared lazy DFA concurrently from RunParallel. On ARM64 without SIMD prefilters: cache corruption, 1.7GB allocs, 1s+ per op on M2 Max. Fix: prefilter for fast rejection only, pooled PikeVM for verification.

Added

TestConcurrentCaseInsensitivePrefilter — 8 goroutines × 100 iterations, match + no-match paths. Catches the race on macOS with -race flag.

Reported by @tjbrains on Apple M2 Max.

Full Changelog: v0.12.13...v0.12.14

Contributors

tjbrains

Assets 2

18 Mar 12:54

kolkov

v0.12.13

7a29fab

v0.12.13: FatTeddy fix, prefilter acceleration, AC v0.2.1

What's Changed

Performance

FatTeddy VPTEST hot loop — 8 instructions → 1 for candidate detection (24% faster scan)
FatTeddy batch FindAllPositions — one ASM call per 64KB chunk, eliminates Go→ASM round trips. FindAll 39ms → 22ms (1.8x)
Prefilter-accelerated isMatch/FindIndices — candidate loop with anchored DFA for large NFAs (>100 states). #137 match case: 176μs → 27μs
Cascading prefix trim (Rust-style) — >64 literals trimmed to fit Teddy. auth_attempts 34ms → 7ms

Fixed

FatTeddy AVX2 ANDL→ORL — lane combining missed single-lane patterns. (?i)get|post|put: 11456 → 34368 matches
Non-amd64 build — added hasAVX2 and batch stubs for macOS ARM64

Dependencies

ahocorasick v0.1.0 → v0.2.1 — DFA backend + SIMD prefilter, 11-22x throughput

LangArena LogParser total: 757ms → 144ms (5.3x faster). Gap to Rust: 13x → 2.5x.

Full Changelog: v0.12.12...v0.12.13

Assets 2

17 Mar 20:55

kolkov

v0.12.12

ee2d823

v0.12.12: prefix trimming for case-fold literals

What's Changed

Performance

Prefix trimming for case-fold expanded literals — when (?i) expansion produces >32 incomplete prefix literals, trims to 4-byte prefixes and deduplicates. Fits Teddy SIMD prefilter instead of slower Aho-Corasick. LangArena suspicious: 117ms → 5.7ms (20x faster, 2.6x from Rust, was 54x).

Full Changelog: v0.12.11...v0.12.12

Assets 2

Releases: coregx/coregex

v0.12.21: Tagged start states, zero-alloc API, 5 patterns faster than Rust

Performance

DFA Engine

Allocation

New Public API (zero-alloc)

Benchmarks (EPYC CI, 6MB input)

Zero-Alloc API (new methods vs stdlib-compat)

Fixed

Full Changelog

Uh oh!

v0.12.20: Premultiplied StateIDs, break-at-match, Phase 3 elimination

Performance

DFA Core — Premultiplied + Tagged StateIDs

DFA Core — Rust-aligned Determinize

Meta Engine

Prefilter

Benchmarks

Full Changelog

Uh oh!

v0.12.19: zero-alloc captures, 95% less memory

What's Changed

Performance

Memory Impact

Documentation

Benchmark (regex-bench, AMD EPYC)

Uh oh!

v0.12.18: Flat DFA + integrated prefilter — 3x from Rust

Performance

Fixed

Benchmark (AMD EPYC, 6MB)

Uh oh!

v0.12.17: Fix LogParser ARM64 regression

Fixed

Benchmark (AMD EPYC, 6MB)

Contributors

Uh oh!

v0.12.16: Fix (?m)^ regression — WrapLineAnchor

Performance

Fixed

Contributors

Uh oh!

v0.12.15: Per-goroutine DFA cache + 7 correctness fixes

Performance

Correctness (7 fixes)

Testing

Benchmark Results (AMD EPYC, 6MB input)

Uh oh!

v0.12.14: concurrent isMatchDFA safety fix

Fixed

Added

Contributors

Uh oh!

v0.12.13: FatTeddy fix, prefilter acceleration, AC v0.2.1

What's Changed

Performance

Fixed

Dependencies

LangArena LogParser total: 757ms → 144ms (5.3x faster). Gap to Rust: 13x → 2.5x.

Uh oh!

v0.12.12: prefix trimming for case-fold literals

What's Changed

Performance

Uh oh!