Problem
Teddy SIMD multi-pattern prefilter is 8.6x slower than Rust on identical workloads despite implementing the same algorithm (PSHUFB nibble-based matching).
Benchmark (6 MB DNA input, AMD EPYC, regex-bench CI)
| Engine |
dna_4 time |
Throughput |
| Rust regex (Teddy AVX2, inlined) |
3.7 ms |
1.6 GB/s |
| coregex (Teddy SSSE3, Go asm) |
32 ms |
187 MB/s |
Go stdlib regexp |
349 ms |
17 MB/s |
All 9 regexdna patterns affected (v0.12.3, UseTeddy strategy).
Root Cause
Go/assembly function call boundary overhead:
- Each
findSIMD() = Go→asm→Go round-trip (~50-65 cycles: function call + VZEROUPPER + register save/restore)
- 375K calls per 6MB scan (16 bytes/call with SSSE3)
- AVX2 disabled — was 4x slower than SSSE3 due to VZEROUPPER cost dominating
- Rust avoids this entirely —
#[inline(always)] keeps the whole find+verify loop in one native code block
Profiling (local, 6MB DNA)
FindAllIndicesStreaming: avg 25.1 ms
Raw Teddy.FindMatch: avg 24.0 ms (96%)
FindAll loop overhead: avg 1.1 ms ( 4%)
100% of the gap is in SIMD function call overhead.
Breakdown
| Factor |
Estimated Impact |
| SSSE3 vs AVX2 (16 vs 32 bytes/iter) |
~2x |
| Go/asm boundary per findSIMD call |
~3-4x |
| Go method dispatch + bounds checks |
~1.2x |
| Combined |
~8-10x (measured: 8.6x) |
Potential Solutions
A. simd/archsimd Intrinsics (Go 1.26, GOEXPERIMENT=simd)
Rewrite Teddy core using compiler intrinsics — eliminates Go/asm boundary entirely.
Needs POC to verify: Permute() = PSHUFB? ToMask() = PMOVMSKB? Performance?
B. Batch FindAll in Assembly
Single Go→asm call for entire haystack. Find+verify loop stays in asm.
Results written to pre-allocated buffer.
C. Monolithic ASM Teddy
Entire FindMatch loop in assembly. Maximum performance, highest maintenance.
Research
Full analysis: docs/dev/research/teddy-10x-gap-analysis.md
Related
Problem
Teddy SIMD multi-pattern prefilter is 8.6x slower than Rust on identical workloads despite implementing the same algorithm (PSHUFB nibble-based matching).
Benchmark (6 MB DNA input, AMD EPYC, regex-bench CI)
regexpAll 9 regexdna patterns affected (v0.12.3, UseTeddy strategy).
Root Cause
Go/assembly function call boundary overhead:
findSIMD()= Go→asm→Go round-trip (~50-65 cycles: function call + VZEROUPPER + register save/restore)#[inline(always)]keeps the whole find+verify loop in one native code blockProfiling (local, 6MB DNA)
100% of the gap is in SIMD function call overhead.
Breakdown
Potential Solutions
A.
simd/archsimdIntrinsics (Go 1.26, GOEXPERIMENT=simd)Rewrite Teddy core using compiler intrinsics — eliminates Go/asm boundary entirely.
Needs POC to verify:
Permute()= PSHUFB?ToMask()= PMOVMSKB? Performance?B. Batch FindAll in Assembly
Single Go→asm call for entire haystack. Find+verify loop stays in asm.
Results written to pre-allocated buffer.
C. Monolithic ASM Teddy
Entire FindMatch loop in assembly. Maximum performance, highest maintenance.
Research
Full analysis:
docs/dev/research/teddy-10x-gap-analysis.mdRelated
prefilter/teddy_ssse3_amd64.go:76-82