Skip to content

SIMD acceleration for rapidxml byte-scan hot paths#81

Draft
NelsonVides wants to merge 4 commits intoperf/to_binaryfrom
perf/simd_parse
Draft

SIMD acceleration for rapidxml byte-scan hot paths#81
NelsonVides wants to merge 4 commits intoperf/to_binaryfrom
perf/simd_parse

Conversation

@NelsonVides
Copy link
Copy Markdown
Collaborator

This is controversial, ugly, vendors in SIMDe, it's a super verbose change, and C++26 is going to standardise most of this mess into hopefully something cleaner (though C++ has a history of making a worse mess of all great ideas, and compilers will be ready by like... 2028 if not later). But still, it brings from 0x (no changes, no regressions) to up to 5x improvements. Improvement grows the longer the string that doesn't need escapes (big bodies, CDATA, etc).

I'm experimenting with SIMD in general, might close this later :)


Add SIMDe-based 128-bit (SSE2/NEON) fast paths for
rapidxml::xml_document::skip() and the longer delimiter scans (comment,
CDATA, PI, declaration end). Each predicate dispatches to a SIMD
specialization via int/long overload tag, falling back to the original
scalar loop when no specialization exists.

  • New c_src/simd_skip.hpp with 11 predicate specializations, scalar
    prefix for short runs (< 16 bytes), and a single-movemask OR-reduction
    pattern
  • rapidxml.hpp: forward declarations + skip() dispatch hooks; trim
    trailing whitespace and comment/CDATA/PI/declaration scans wired to
    SIMD helpers
  • exml.cpp: 15-byte zero tail padding so 16-byte SIMD loads stay in
    bounds; parse_next adjusted to use the logical (pre-padding) buffer
    end
  • rebar.config: -I c_src/simde for exml_nif.so (baseline NIF
    exml_nif_base.so stays scalar for A/B benchmarking)

Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2
(x86_64) and NEON (AArch64/Apple Silicon) without platform-specific
ifdefs or compile flags.

git-subtree-dir: c_src/simde
git-subtree-split: 71fd833d9666141edcd1d3c109a80e228303d8d7
Add SIMDe-based 128-bit (SSE2/NEON) fast paths for
rapidxml::xml_document::skip() and the longer delimiter scans (comment,
CDATA, PI, declaration end). Each predicate dispatches to a SIMD
specialization via int/long overload tag, falling back to the original
scalar loop when no specialization exists.

- New c_src/simd_skip.hpp with 11 predicate specializations, scalar
  prefix for short runs (< 16 bytes), and a single-movemask OR-reduction
  pattern
- rapidxml.hpp: forward declarations + skip() dispatch hooks; trim
  trailing whitespace and comment/CDATA/PI/declaration scans wired to
  SIMD helpers
- exml.cpp: 15-byte zero tail padding so 16-byte SIMD loads stay in
  bounds; parse_next adjusted to use the logical (pre-padding) buffer
  end
- rebar.config: -I c_src/simde for exml_nif.so (baseline NIF
  exml_nif_base.so stays scalar for A/B benchmarking)

Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2
(x86_64) and NEON (AArch64/Apple Silicon) without platform-specific
ifdefs or compile flags.
@NelsonVides NelsonVides self-assigned this Apr 9, 2026
@NelsonVides NelsonVides added the WIP Don't review WIP-s! label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP Don't review WIP-s!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant