Skip to content

Perf script parity#1

Open
ixtli wants to merge 39 commits into
mjc:mainfrom
ixtli:perf-script-parity
Open

Perf script parity#1
ixtli wants to merge 39 commits into
mjc:mainfrom
ixtli:perf-script-parity

Conversation

@ixtli

@ixtli ixtli commented Jun 11, 2026

Copy link
Copy Markdown

check docs for the story

Summary by CodeRabbit

  • New Features

    • Added --inline flag to expand callchains into DWARF inline frames for fold and flamegraph commands.
    • Added AArch64 DWARF unwinding support alongside existing x86-64 support.
  • Bug Fixes

    • Fixed build-ID parsing to handle variable-length identifiers from perf records.
    • Improved file-identity matching using device major/minor and inode.
    • Fixed symbol offset reporting for kernel and module symbols.
  • Performance

    • Optimized mapping insertion and overlap detection.
    • Added memoization for unwinding info lookups.
  • Documentation

    • Added oracle testing framework and architecture-specific unwind specifications.

ixtli and others added 30 commits June 11, 2026 12:57
cargo metadata reports manifest paths as given while the crate root is
canonicalized, so target discovery failed through macOS /var symlinks.
Canonicalize before comparing.

scripts/perf-oracle builds a Docker image with perf + inferno, records
dwarf and fp call-graph profiles of a sample workload, exports perf
script text, and runs the pyroclast fold comparison inside the
container, leaving oracle artifacts under target/oracle.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PERF-2/PERF-3: mmap-record ingestion was O(n) per record with an
unconditional global index rebuild (itself ~O(m^2) for the largest pid
bucket), plus a separate full-table linear scan in
has_overlapping_user_mapping_for_pid -- O(n^2) over the record stream on
large perf.data files.

Add a shared interval-overlap query (any_indexed_mapping_in_range) over
the existing max_end-augmented per-pid index. insert_mapping now tests
for overlap first and takes a pure incremental insert
(insert_mapping_without_overlap_fix) in the common no-overlap case,
skipping the whole-table mem::take and global rebuild entirely. The
actual-split path still uses the rebuild, preserving perf's exact
retained-mapping ordering and before/after split semantics byte for
byte. has_overlapping_user_mapping_for_pid routes through the same
index.

Complexity per mmap record:
- no-overlap (common): O(n) full scan + unconditional O(m^2) rebuild
  -> O(log n + matches) overlap probe + O(bucket suffix) insert.
- has_overlapping_user_mapping_for_pid: O(N) full-table scan
  -> O(log n + matches) per pid.
- actual split (rare): unchanged.

Behavior is identical: existing perfdata_mappings/records/fold tests
pass unchanged (same pre-existing macOS baseline failures, no new ones).
Adds tests for multi-overlap split ordering, the no-overlap fast path,
and a linear-scan oracle for the indexed overlap query (incl.
zero-length and adjacency edge cases).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
remove_overlapping_mappings_like_perf re-scanned the whole mapping table
and rebuilt every pid index on each mmap record; the overlap probe now
walks the max_end-augmented per-pid index and the no-overlap case
inserts incrementally. Overlap splits keep the original perf-faithful
rebuild path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The procfs walk was cfg-gated to Linux even though it takes an
injectable root, so the thread-lister tests could not run elsewhere. A
plain read_dir over <root>/<pid>/task is behavior-identical on Linux
and portable, and procfs was the crate's only consumer.

Also add docs/parity-findings.md recording the parity gaps, bugs, and
performance issues found while building the perf oracle harness.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Profile kinds without a real backend on the current platform silently
ran FakeBackend and wrote fake artifacts; report a clear unsupported/
not-implemented error instead. Backend-selection tests now pass an
explicit "linux" platform through the existing _on_platform entry
points so the suite no longer assumes a Linux host, and the NixOS
System.map test canonicalizes its expectation through tempdir symlinks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
struct perf_file_header (tools/perf/util/header.h) lays out the
adds_features DECLARE_BITMAP at byte offset 72: magic(8) + size(8) +
attr_size(8) + attrs(16) + data(16) + event_types(16) = 72. pyroclast
read the bitmap at offset 56, which is the event_types perf_file_section,
so it saw no features at all on real perf.data files. That silently
disabled HEADER_BUILD_ID resolution and (once parsed) HEADER_EVENT_DESC
event naming.

The synthetic perf.data builders in the tests wrote the feature bits at
the same wrong offset, so they masked the bug; move them to offset 72 to
match the kernel layout.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
tools/perf/util/build-id.c write_buildid() emits the
perf_record_header_build_id records in the HEADER_BUILD_ID feature
section WITHOUT setting header.type (it stays 0), sets
PERF_RECORD_MISC_BUILD_ID_SIZE in misc, and stores the real build-id
length in the size byte at offset 20 of the 24-byte build_id field
(record offset 32).

pyroclast rejected these records because it required
header.type == PERF_RECORD_HEADER_BUILD_ID (67). That check never fired
before because the feature bitmap was read at the wrong offset, so no
build-id feature was ever found; now that features parse, real perf.data
files hit this path. Drop the type gate (record framing already comes
from the size field, matching perf_header__read_build_ids) and honor the
size byte.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… spacing

perf script prints evsel names verbatim from the HEADER_EVENT_DESC
feature (write_event_desc/read_event_desc in tools/perf/util/header.c)
rather than reconstructing them from attr type/config, so e.g. it shows
"task-clock:ppp" where pyroclast reconstructed "task-clock". Parse the
EVENT_DESC feature (nre, attr_sz, then per event: attr bytes, nr_ids, a
do_write_string name, and the id array) and use those names, matching
each attr by shared sample id with an event-index fallback (as
process_event_desc pairs descriptions to evsels). Fall back to the
type/config reconstruction when no description is present.

Also match builtin-script.c's header spacing: it prints the event name
with `fprintf(fp, "%*s: ", name_width, evname)` (trailing space) then
`fputc(cursor ? '\n' : ' ')`. The multi-frame callchain path
(cursor set) therefore ends the header line with "<name>: \n"; pyroclast
was emitting "<name>:\n".

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Groundwork for aarch64 DWARF user unwinding (the dwarf call-graph path
is the parity priority): PerfAarch64Regs decodes PERF_REG_ARM64 masks
(fp=29 lr=30 sp=31 pc=32) into framehop UnwindRegsAarch64, the no-CFI
fallback mirrors elfutils backends/aarch64_unwind.c (caller pc from lr,
next lr/fp at fp+8/fp+0, next sp fp+16, accept iff fp == 0 || newSp >
sp), and parse_header_arch reads the HEADER_ARCH feature string used to
pick the register model. Not yet wired into the fold path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Plain `perf script` and `perf | inferno-collapse-perf` print exactly one
frame per callchain entry, named from the ELF symtab
(builtin-script.c sample__fprintf_sym without --inline). pyroclast was
expanding every entry into its DWARF inline frames by default, so a
single quicksort IP rendered as a stack of small_sort/swap_if_less/etc.

Add an `inline` flag to FoldOptions (exposed as `--inline` on fold and
perf-script) and gate inline expansion behind it. The default path now
resolves only the base object symbol via resolve_*_base_* and renders one
line per entry; the prefetcher routes those frames to the base batch.
`--inline` keeps the perf-script-style inline expansion.

Two rendering fixes on the symbolized script path:
- Print the mapping's full DSO name for every mapped frame
  (builtin-script.c map__fprintf_dsoname) instead of the hardcoded
  "([unknown])", including the multi-label inline branch and
  InlineCurrentIp chains.
- Route mapped symbol labels through write_perf_script_mapped_symbol_frame
  rather than the label writer that re-appended "+0x0" to a label that
  already carried its symbol offset (e.g. "quicksort+0x6cc+0x0").

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
scripts/oracle/run-in-container.sh recorded perf.data AND ran the
pyroclast comparison in one shot, so iterating on pyroclast meant
re-recording every time. Split it:

- record-in-container.sh records only the oracle inputs
  (*.perf.data, *.perf.script, *.inferno.folded).
- compare-in-container.sh rebuilds pyroclast incrementally
  (CARGO_TARGET_DIR under /oracle-out/target), regenerates only the
  *.pyroclast.* artifacts and bench reports against the already-recorded
  inputs, and rebuilds /tmp/oracle-workload so symbol resolution by path
  works in a fresh compare container, then prints the fp script/folded
  diffs and the inferno_compare scoreboard.
- run-in-container.sh now just runs record then compare.

scripts/oracle-compare drives compare-in-container.sh from the host,
mounting this worktree at /work and the main checkout's target/oracle at
/oracle-out, without re-recording or overwriting the recorded inputs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The base/symtab default changed three behaviors that existing tests
encoded:
- perf script header lines end with the event-name colon plus a space
  ("%*s: " then fputc('\n') in builtin-script.c), so the run_cli
  perf-script expectations gain that trailing space before the newline.
- Symbolized fold/perf-script now resolve the single base ELF symbol by
  default; the external addr2line resolver only runs on the --inline
  path. The benchmark/run_cli tests that verify addr2line invocation now
  request --inline (FoldArgs/PerfScriptArgs/FlamegraphArgs gained the
  flag, and the benchmark helpers gained an inline parameter).

The linux-perf `profile` backend is a human-facing flamegraph command,
not a byte-parity target, so it keeps the DWARF inline expansion it has
always emitted (inline: true).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
perf recorded the C library mapping as "/ (deleted)" (unlinked during
the run) but symbolized it at record time from the live mapping, storing
the DSO build-ids in HEADER_BUILD_ID. pyroclast resolves build-ids out of
~/.debug/.build-id/<aa>/<rest>/elf, so seed that cache from the
container's on-disk libc/ld/libgcc (whose build-ids match the recording,
since record and compare share the image). Without it the deleted libc is
the only residual diff in the fp script comparison.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The benchmark scoreboard (benchmark_fold_options) weights stacks by
sample period, so generate the standalone *.pyroclast.folded artifact the
same way; otherwise the printed folded diff disagrees with inferno's
period-weighted counts purely on the count column.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fixes from oracle byte-diffing: the perf feature bitmap was read at
offset 56 instead of 72 (perf_file_header places adds_features after
the three file sections), silently disabling HEADER_EVENT_DESC and
header build-ids; feature-section build-id records carry type 0 and a
misc size flag; event names come verbatim from HEADER_EVENT_DESC with
attr-id matching; mapped frames print the mapping's full DSO path in
every branch; offset labels are no longer double-suffixed; and inline
expansion moved behind --inline (perf script only expands inline frames
for dwarf callchains; the plain output prints one symtab-named line per
entry). Adds scripts/oracle-compare for fast in-container comparisons
against recorded oracle inputs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
git rev-parse --git-common-dir returns plain .git from the main
checkout, which sent oracle-compare looking for inputs under
.git/target/oracle.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PerfDwarfNameResolver re-parsed each object's DWARF and re-walked the
DIE trees of every matching unit on every resolve batch, once per
finished round per hot DSO — the dominant fold cost with --inline and
the likely cause of the historical bench timeout. CachedObjectMetadata
now memoizes the per-unit frame indexes: unit ranges are scanned once
and each unit's index is built on the first batch whose addresses land
in it, with an append-only name interner keeping ids stable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
perf.data records the recording machine's arch in HEADER_ARCH, not per
attr, so the whole fold/script path needs one arch to decode REGS_USER
samples and pick the matching framehop unwinder.

Introduce an arch-neutral PerfUserRegs enum over PerfX86_64Regs and
PerfAarch64Regs with ip()/sp()/is_syscall_return_state() accessors, and
make FramehopUnwinder hold a per-arch unwinder+cache (UnwinderX86_64/
CacheX86_64 vs framehop::aarch64::UnwinderAarch64/CacheAarch64). Module
registration (add_module from the shared ExplicitModuleSectionInfo) stays
arch-agnostic; only the seeded registers and iter_frames differ.

Read the arch once per file via parse_header_arch (reading the feature
table+payload from the File the way build_id_events_from_file does),
map through PerfArch::from_header_arch defaulting to x86_64, store it on
FoldAccumulator, and construct each PidUnwindState's unwinder for that
arch. append_perf_user_unwind_frames now decodes regs by arch and the
downstream sites take the neutral type.

Arch-gate the x86-only logic faithfully to elfutils:
- is_syscall_return_state has no arm64 analogue (always false).
- the no-CFI ebl_unwind fallback keeps x86's bp>=sp precondition plus
  unwind_x86_64_frame_pointer_stack_like_elfutils, while aarch64 calls
  unwind_aarch64_frame_pointer_stack_like_elfutils with no bp/sp
  precondition (its accept condition is internal).
- frame_pointer_at_or_above_stack_pointer is computed per arch rather
  than forcing a fake bp/sp onto arm64.

x86_64 behavior is unchanged: 632 pass / 5 pre-existing arch-fixture
failures, same as the 1c242da baseline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Cover the arm64 path end to end: a synthetic perf.data with a HEADER_ARCH
"aarch64" feature and a REGS_USER/STACK_USER dwarf sample (sparse fp/lr/
sp/pc mask) folds through the elfutils aarch64 frame-pointer fallback to
the expected leaf + lr-derived caller.

framehop's aarch64 unwinder yields only the seed pc when no CFI covers
it (unlike x86_64, whose own frame-pointer recovery advances the stack),
which is precisely when libdwfl invokes ebl_unwind on the leaf. Fire the
aarch64 fp-chain fallback when framehop produced no caller beyond the
sampled pc (empty or seed-only), keeping the x86_64 trigger on truly
empty framehop output so x86_64 behavior is unchanged. Add unit tests for
both the seed-only fire and the multi-frame keep cases.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The dwarf oracle was recorded on arm64, where `perf script` expands DWARF
inline frames by default, so the recorded dwarf.perf.script carries inline
frames. To line up frame counts in the comparison the dwarf fold must run
with inline on.

Add a --inline flag to BenchArgs/pyroclast-bench and thread it through
run_bench_command -> run_streaming_comparison_with_symbols ->
run_pyroclast_stream so the folded benchmark output uses
benchmark_fold_options(inline). The fp pair stays inline-off (its oracle
was recorded without inline).

In scripts/oracle/compare-in-container.sh, pass --inline to the dwarf
bench plus the dwarf `plumbing fold`/`plumbing perf-script` artifact
regen, and emit per-name script/folded diffs and scoreboards (previously
only fp printed) so the dwarf comparison is visible.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PerfUserRegs dispatches per-arch register decoding (HEADER_ARCH picks
the model, default x86_64), FramehopUnwinder hosts per-arch framehop
unwinders behind shared module registration, and the no-CFI fallback is
arch-gated: x86_64 keeps the bp>=sp rbp chain on empty unwinds, aarch64
applies the elfutils lr/fp walk whenever framehop yields only the seed
pc. The bench and dwarf oracle compare learn --inline since perf script
expands inline frames for DWARF callchains by default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
perf's modern external-addr2line srcline backend names each inline frame
from the mangled symtab/DWARF linkage name returned by `addr2line -f -i`
(tools/perf/util/addr2line.c passes `-a -i -f`, never `-C`) and demangles
it itself with the Rust v0 demangler in alternate form
(tools/perf/util/srcline.c new_inline_sym -> tools/perf/util/symbol.c
dso__demangle_sym -> rust_demangle_display_demangle(.., /*alternate=*/true)).
That yields fully-qualified names with generic args preserved and no
trailing ::h<hash>.

pyroclast's DIE walk was modeled on a libdw-backed perf and emitted the
bare DW_AT_name leaf (`sort<u64, ...>`). Source each frame's name from
DW_AT_linkage_name on the inlined_subroutine's DW_AT_abstract_origin /
subprogram DIE instead, demangled via addr2line::demangle_auto (byte-
identical to perf's alternate Rust demangle for both `_ZN` and `_R`),
falling back to the DW_AT_name spelling when no linkage name exists
(closures and shim DIEs). Verified name-for-name against
target/oracle/dwarf.perf.script (perf 6.17.13, addr2line backend).

The fixture-gated tests/symbols.rs prefer-DIE-names test encoded the
abandoned libdw spelling (asserted pyroclast differs from addr2line); it
now asserts pyroclast matches external-addr2line's qualified frames.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
These tests mapped the host test binary, so their fixtures depended on
the host: a Mach-O/arm64 test binary recovers callers through
__unwind_info at offsets where a Linux x86_64 binary has none, which is
why they failed off-Linux. A handcrafted ELF with one PT_LOAD and an
eh_frame whose only FDE covers an unrelated range pins the encoded
libdw scenario on every host: the module reports with CFI present, the
sampled IP is uncovered, and framehop stops at the seeded IP rather
than taking its unknown-module frame-pointer fallback. The object
imports now gate on the linux-only libc leaf test that still parses a
real binary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
perf's machine.c unwind_entry() runs append_inlines() on EVERY accepted
unwind entry, including the initial sampled IP, so `perf script --inline`
expands the leaf's DWARF inline chain just like a caller frame
(.ace-research-perf-unwind.md §4). pyroclast classified the leaf
(index 0, address == regs.ip) as FoldFrame::InlineCurrentIp and rendered
it from the single base symtab symbol, so addresses like aaaabe3d880c
printed only `quicksort+0xe4` where perf expands 6 inline frames + base.

Route InlineCurrentIp leaves through the full DWARF inline path (prefetch,
script, and folded) whenever --inline is set; keep the base-only rendering
for the no-inline default. Updated the prefetch-batching test to the
unwind_entry() behavior with a perf-source citation.

Also drop the .debug_str generic specialization from the inline-frame
resolution path. Inline names now come from the demangled DWARF linkage
name (matching perf's external-addr2line backend), so they are already
fully qualified; re-specializing rewrote e.g.
`core::slice::<impl [T]>::sort_unstable` to `sort_unstable<u64>`, a spelling
perf never prints (verified against target/oracle/dwarf.perf.script). The
no-inline base path keeps the specialization (fp oracle path, unchanged).

Folded dwarf oracle is now byte-identical to inferno on every line except
the documented PAC-tagged [unknown] frames and kernel kallsyms frames.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
perf's evsel_fprintf.c prints each callchain node as
`sym+0x<off> <dso-or-inlined>`: __symbol__fprintf_symname_offs appends
`+0x<off>` (`al->addr - sym->start`), inline frames substitute ` (inlined)`
for the DSO name (`print_dso && (!sym || !sym->inlined)`), and the trailing
non-inline base frame prints the mapped DSO path. Crucially every frame of
one inline group shares the SAME offset, because new_inline_sym reuses
base_sym->start for the fake inline symbols (tools/perf/util/srcline.c).

pyroclast printed inline frames as `name (dso)` with no offset and the real
DSO path on every line. Thread the shared `+0x<off>` suffix (the symtab base
symbol's offset) through ResolvedSymbolFrames -> the frame cache -> the
script writer, and render the inline chain leaf-first with ` (inlined)` on
all but the trailing base frame. The single non-inline base frame keeps its
already-baked with_offset label (no double offset).

The dwarf perf-script oracle diff is now empty except the documented
PAC-tagged [unknown] frames and kernel kallsyms frames.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…chine

perf script resolves kernel callchain frames through kallsyms; pyroclast
folded them as `[[kernel.kallsyms]]` / printed `[unknown]
([kernel.kallsyms]_stext)`. The recorded perf.data carries a kernel
build-id, and can_use_system_kernel_symbols() refused live /proc/kallsyms
for [kernel.kallsyms] requests to avoid symbolizing against a mismatched
kernel.

Read the running kernel's GNU build-id from /sys/kernel/notes
(NT_GNU_BUILD_ID) and allow live /proc/kallsyms when it matches the
perf.data's recorded kernel build-id — the recording machine, where live
kallsyms describes the same kernel perf symbolized against (the oracle
container records and compares in the same image). Confirmed the recorded
and live build-ids match (cb97c0ad...3644a0).

Also make kernel frames perf-script-faithful:
- resolve them as `name+0x<off>` (tools/perf/util/symbol_fprintf.c
  __symbol__fprintf_symname_offs prints the offset, incl. +0x0); the folded
  path strips it like every other frame.
- print the DSO column as `[kernel.kallsyms]`, not the relocation-reference
  map name `[kernel.kallsyms]_stext` (perf names the kernel dso
  `[kernel.kallsyms]` in machine__create_kernel_maps).

Updated the kernel-resolution tests to assert the +0x<off> perf-script
spelling with a symbol_fprintf.c citation.

The dwarf perf-script and folded oracle diffs are now empty except the
documented PAC-tagged [unknown] frames perf prints from un-stripped lr
values.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Inline frame names come from DW_AT_linkage_name demangled the way perf
demangles symbols itself (perf's external addr2line runs without -C);
sampled-IP leaves expand through the full inline chain like
machine.c unwind_entry() running append_inlines() on every accepted
entry; inline script lines render sym+0xoff (inlined) sharing the base
frame's offset; and [kernel.kallsyms] frames resolve from live
/proc/kallsyms when /sys/kernel/notes matches the recorded kernel
build-id. The dwarf oracle script and folded outputs now match real
perf script byte-for-byte apart from PAC-tagged [unknown] frames perf
prints from un-stripped aarch64 lr values.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ixtli and others added 9 commits June 11, 2026 15:26
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
file_matches_recorded_identity compared only the inode, ignoring the
device major/minor that PERF_RECORD_MMAP2 records alongside it. Inode
numbers are unique only within a single filesystem, so two files sharing
an inode on different mounts would be treated as the same backing store.

perf's __dso_id__cmp (tools/perf/util/dso.c) compares maj/min/ino
together as the dso backing-storage identity, never the inode alone.
Decompose the on-disk st_dev with the glibc MAJOR()/MINOR() encoding the
kernel uses when emitting the mmap2 device fields, and require all three
to match. This keeps the helper faithful to perf's identity comparison
and forecloses cross-filesystem false matches.

Note: perf never uses this device/inode identity to reject an on-disk
file before symbolizing/unwinding (dso__load and do_open trust the path,
validating only build-ids when both are defined); should_load_unwind_object
already ignores file_identity to match that. This helper is solely the
dso-instance identity comparison.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The same on-disk object can reach the fold path two ways: as an inline
PERF_RECORD_MMAP2 build-id record (build_id, no file_identity) or as a
plain MMAP2 carrying file_identity plus a HEADER_BUILD_ID entry that
supplies the same build_id. Both the MmapTable SymbolSourceKey and the
SymbolRequest cache key mixed file_identity in alongside build_id, so the
two forms hashed to different keys and the same DSO resolved through two
symbol-cache entries — re-parsing its DWARF twice and risking split fold
counts.

perf identifies a dso backing store via __dso_id__cmp
(tools/perf/util/dso.c): once both sides carry a defined build_id it is
the decisive comparison, and the mmap2 maj/min/ino are only weighed when
both ids recorded them (mmap2_valid). Match that: when a build_id is
present, drop file_identity from the identity key and rely on
(path, build_id); fall back to file_identity only when no build_id
exists. Distinct build_ids at the same path stay distinct, so a replaced
binary is still a separate symbol source.

Applied consistently to SymbolSourceKey (driving symbol_source_id) and to
SymbolRequest's Eq/Hash/Ord via a shared identity_file_identity() so the
two cache layers agree. Oracle fp/dwarf script and folded diffs are
unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Close the last two open unwind-parity gaps (pyroclast-5gr, pyroclast-pkh)
by modeling libdwfl firing the initial-frame callback exactly once.

Gap-5gr (leaf-only emission): perf_accepted_object_unwind_frames now
emits the single sampled-IP leaf when the scenario-D predicate holds —
the initial IP reported into a module, no FDE covers it
(!has_unwind_info_for_ip), and the arch ebl_unwind fallback provably
cannot advance (x86_64 bp < sp; aarch64 lr == 0). This mirrors
libdwfl/dwfl_frame.c calling the callback for the seeded frame before
__libdwfl_frame_unwind, where handle_cfi finds no FDE row (so it cannot
allocate an unwound frame and falls through) and ebl_unwind is guarded
off (backends/x86_64_unwind.c `if (sp >= fp) return false;` /
backends/aarch64_unwind.c `if (lr == 0 ... ) return false;`). When CFI
DOES cover the IP we cannot tell PC_UNDEFINED end-of-stack from a real
PC_SET caller a priori, so that case stays MustUnwind and runs framehop.
The predicate is exactly the narrow §3.6/§4 one from the research doc;
broader rules reintroduced the measured 10.1B/41.2B overcounts.

Gap-pkh (skip gate): the same predicate is classified BEFORE framehop
(SkipUnwind | LeafOnly | MustUnwind), so leaf-only and zero-frame
samples never pay for a framehop unwind. The leaf truncation lives in
the shared acceptance tail, so the gated (framehop-skipped) and ungated
paths are byte-identical. Two caches back the gate: an exact-ip
CFI-presence memo on FramehopUnwinder (PERF-6) and a per-(pid, ip)
leaf-only eligibility memo on PidUnwindState, both invalidated with the
pid's unwind state when its mappings change or it forks.

PERF-4: report_unwind_module_for_ip_like_perf now distinguishes a newly
loaded module from an already-present one, so the module-report retry
loop only re-unwinds when a pass actually loaded something new, dropping
the redundant final re-unwind.

Also remove the KeepDsoLeaf/DropSyntheticCurrentIp policy: perf's
frame_callback fires for the initial frame regardless of whether the
covering module is a shared object or the main executable, so the
.so-vs-exe split had no perf-source basis.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Synthetic TDD over the SyntheticX86_64Object fixture (plus a new
SyntheticAarch64Object) covering the gap-5gr / gap-pkh predicate:

- leaf emitted end-to-end when the predicate holds (x86_64 bp < sp, no
  FDE covering the sampled IP; aarch64 lr == 0)
- NOT routed through the leaf-only path when bp >= sp (elfutils rbp
  fallback territory -> MustUnwind, framehop authoritative)
- NOT truncated to a leaf when an FDE covers the IP (handle_cfi
  PC_UNDEFINED vs PC_SET is indistinguishable a priori, so MustUnwind)
- the skip gate yields the perf-correct single leaf, byte-identical to
  letting the shared acceptance tail truncate after framehop ran

Unit tests pin the predicate edges directly:
arch_fallback_provably_cannot_advance (x86_64 bp<sp / aarch64 lr==0),
classify_object_unwind routing, and perf_accepted_object_unwind_frames
truncating a spurious framehop caller to the leaf only when the
predicate holds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Cover the gap-pkh has_unwind_info_for_ip memo: a repeated query is
consistent, and adding a module clears the memo so a previously-cached
"uncovered" ip recomputes to "covered" once a module with CFI over it
loads. Pins the byte-identity guarantee that the memo never diverges
from the linear FDE-range scan.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mark both unwind-parity issues closed: the libdw scenario-D leaf-only
emission (5gr) and the pre-framehop skip gate + caches (pkh) are
implemented in 62e3d62 with tests in c49bb8b/503e141. Oracle dwarf/fp
diffs unchanged (documented PAC and "/ (deleted)" exclusions only).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
libdwfl fires the frame callback once for the seeded IP before
unwinding, so perf keeps a single-leaf stack whenever the initial IP
reported into a module and neither CFI nor the arch ebl fallback can
produce a caller (x86_64: no FDE at ip and bp < sp per
backends/x86_64_unwind.c's sp >= fp+16 guard; aarch64: no FDE and
lr == 0). The same predicate classifies samples before framehop runs
(SkipUnwind | LeafOnly | MustUnwind, memoized per (pid, ip) and
invalidated with the pid's unwind state), CFI presence is memoized
per ip, and the module-report retry loop only re-unwinds when a new
module actually loaded. When CFI covers the ip the sample stays
MustUnwind: an FDE row with undefined RA and one with a real caller
are indistinguishable a priori (libdwfl handle_cfi). The unsourced
.so-vs-exe initial-frame policy split is removed. Closes
pyroclast-5gr and pyroclast-pkh.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements architecture-aware unwinding with inline frame expansion for pyroclast's perf.data folding. It adds support for aarch64 alongside x86_64, parses perf header metadata for architecture and event names, introduces symbol offset suffixes to match perf-script output, refactors mappings for device-major/minor identity matching, and includes a Docker-based oracle harness for parity validation against perf/inferno.

Changes

Architecture-aware unwinding with inline frame expansion

Layer / File(s) Summary
Perf header parsing for architecture and event metadata
src/perfdata/header.rs, tests/perfdata_header.rs
Parse HEADER_ARCH feature to extract recording-machine architecture (x86_64/aarch64) and HEADER_EVENT_DESC to resolve event names, enabling architecture-aware sample decoding and event name matching in folding.
Architecture abstractions: PerfArch, PerfUserRegs, AArch64 decoding
src/perfdata/unwind.rs, tests/perfdata_unwind.rs
Introduce PerfArch enum and PerfUserRegs union type to decode perf's register masks into architecture-specific register layouts; add PerfAarch64Regs with elfutils-compatible frame-pointer-chain unwinding and memoized CFI coverage checks.
Folding engine: arch-aware unwinding and inline frame rendering
src/perfdata/fold.rs, tests/perfdata_fold.rs
Refactor fold.rs to use PerfUserRegs in unwinding paths, implement leaf-only optimization with per-(pid,ip) CFI memoization, add inline frame expansion in perf-script and folded output conditionally based on FoldOptions::inline flag.
Symbol resolution with offsets and build-id-aware keying
src/symbols.rs, tests/symbols.rs
Add base_offset field to ResolvedSymbolFrames, implement offset-aware kallsyms and addr2line resolution, change SymbolRequest cache key to use build_id as discriminator, add GNU build-id note parsing for live kernel build-id verification.
Mapping overlaps and file identity matching
src/perfdata/mappings.rs, tests/perfdata_mappings.rs
Implement O(log n + matches) fast-path incremental insert for non-overlapping mappings, update file-identity matching to use device major/minor + inode (perf-compatible), refactor symbol_source_key to drop file_identity when build_id present to unify deduplication.
Build-id event parsing with size-driven length
src/perfdata/build_id.rs, tests/perfdata_build_id.rs
Update parse_build_id_event to derive build-id length from PERF_RECORD_MISC_BUILD_ID_SIZE bit and size-byte location rather than record-type validation, enabling parsing of perf-generated HEADER_BUILD_ID payloads.
CLI wiring and benchmark integration of inline flag
src/cli.rs, src/lib.rs, src/benchmarks.rs, src/backends/linux_perf.rs, tests/benchmarks.rs
Add --inline flag to FoldArgs/PerfScriptArgs/FlamegraphArgs, thread inline setting through benchmark streaming comparisons, enable inline by default in flamegraph and linux_perf backends, implement benchmark_fold_options helper.
Test updates: arch-aware unwinding, inline rendering, offset expectations
tests/perfdata_fold.rs, tests/run_cli.rs, tests/symbols.rs, tests/perfdata_header.rs, tests/perfdata_mappings.rs, tests/perfdata_unwind.rs
Update existing tests to use PerfUserRegs wrapper, add aarch64 register decoding and unwind tests with synthetic ELF fixtures, adjust perf-script output expectations for trailing spaces and offset suffixes, extend symbolization tests for qualified frame names and build-id cache keys.
Oracle harness: Docker, scripts, workload for perf parity validation
scripts/oracle/*, scripts/perf-oracle, scripts/oracle-compare
Implement containerized oracle workflow to record synthetic workload traces and compare pyroclast vs perf/inferno output, including build-id cache seeding, configurable inline modes, and diff-based parity reporting.
Documentation updates and issue closure
.ace-aarch64-unwind-spec.md, .ace-research-perf-unwind.md, .ace-review-findings.md, docs/parity-findings.md, .beads/issues.jsonl
Document aarch64 DWARF unwind specification and implementation plan, capture perf/libdwfl control-flow research, enumerate performance/correctness findings from code review, update parity-findings with oracle results and resolved architecture/inline-name gaps, close related tracking issues.
Miscellaneous: dependencies, platform helpers, gitignore
Cargo.toml, src/platform.rs, .gitignore, src/cargo_cli.rs
Add signal-hook Unix dependency and remove procfs dependency, refactor linux_thread_ids_from_proc to direct task directory enumeration, improve cargo_cli package discovery with manifest path canonicalization, add .claude/ to gitignore.

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 Hops through arches x86 and aarch64,
Inline frames expand, with offsets ornate,
Perf oracle watches, comparing with glee,
Symbol stacks nested, in harmony! 🎪

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch perf-script-parity

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/benchmarks.rs (2)

369-385: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

inline only reaches the Pyroclast side of streaming comparisons.

run_streaming_comparison_with_symbols(..., inline = true) passes the flag into run_pyroclast_stream, but the Inferno producer still serializes perf-script input with benchmark_fold_options(false). That makes inline benchmarks compare different frame models and report false diffs whenever the perf script is generated from perf.data instead of supplied externally.

Suggested fix
 pub fn run_streaming_comparison_with_symbols<R>(
     perf_data: &Path,
     perf_script: Option<&Path>,
     runner: &R,
     symbols: bool,
     inline: bool,
 ) -> Result<StreamingComparisonReport, String>
 where
     R: CommandRunner + Sync,
 {
     thread::scope(|scope| {
         let (pyro_tx, pyro_rx) = sync_channel(64);
         let (inferno_tx, inferno_rx) = sync_channel(64);
         let pyro_thread =
             scope.spawn(move || run_pyroclast_stream(perf_data, runner, symbols, inline, pyro_tx));
         let inferno_thread = scope
-            .spawn(move || run_inferno_stream(perf_data, perf_script, runner, symbols, inferno_tx));
+            .spawn(move || run_inferno_stream(perf_data, perf_script, runner, symbols, inline, inferno_tx));
 fn run_inferno_stream<R>(
     perf_data: &Path,
     perf_script: Option<&Path>,
     runner: &R,
     symbols: bool,
+    inline: bool,
     line_sender: SyncSender<String>,
 ) -> Result<ProducerResult, String>
 where
     R: CommandRunner + Sync,
 {
@@
                     if symbols {
                         let resolver =
                             perf_symbol_resolver_for_current_home(runner, &export_perf_data);
                         write_inferno_perf_script_file_with_symbols(
                             &export_perf_data,
-                            benchmark_fold_options(false),
+                            benchmark_fold_options(inline),
                             &resolver,
                             &mut stdin,
                         )
                     } else {
                         write_inferno_perf_script_file_with_options(
                             &export_perf_data,
-                            benchmark_fold_options(false),
+                            benchmark_fold_options(inline),
                             &mut stdin,
                         )
                     }?;

Also applies to: 753-773

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/benchmarks.rs` around lines 369 - 385, The inline flag is only passed to
run_pyroclast_stream causing Inferno to still serialize perf-script input with
benchmark_fold_options(false); update run_streaming_comparison_with_symbols to
forward the inline boolean into run_inferno_stream (in the scope.spawn call) and
ensure the Inferno producer uses benchmark_fold_options(true) when inline is
true so both producers use the same frame-model; locate the
run_streaming_comparison_with_symbols function and adjust the call to
run_inferno_stream and any benchmark_fold_options usage to accept and respect
the inline parameter (also apply the same change to the other occurrence
mentioned).

187-208: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor inline in the unsymbolized fold benchmark path.

run_fold_benchmark_with_runner(..., symbols = false, inline = true) still falls back to run_fold_benchmark(input), which hardcodes benchmark_fold_options(false). The new parameter is therefore ignored for the non-symbolized branch, so the reported folded byte/line counts are wrong whenever callers request inline expansion.

Suggested fix
 pub fn run_fold_benchmark_with_runner<R>(
     input: &Path,
     runner: &R,
     symbols: bool,
     inline: bool,
 ) -> Result<FoldBenchmarkReport, String>
 where
     R: CommandRunner,
 {
     if symbols {
         let resolver = perf_symbol_resolver_for_current_home(runner, input);
         run_fold_benchmark_with_writer(input, |writer| {
             write_folded_perfdata_file_with_symbols(
                 input,
                 benchmark_fold_options(inline),
                 &resolver,
                 writer,
             )
         })
     } else {
-        run_fold_benchmark(input)
+        run_fold_benchmark_with_writer(input, |writer| {
+            write_folded_perfdata_file_with_options(
+                input,
+                benchmark_fold_options(inline),
+                writer,
+            )
+        })
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/benchmarks.rs` around lines 187 - 208, The non-symbol branch in
run_fold_benchmark_with_runner ignores the inline parameter by calling
run_fold_benchmark(input) which uses a hardcoded benchmark_fold_options(false);
change the else branch to call run_fold_benchmark_with_writer and invoke
write_folded_perfdata_file (or the equivalent non-symbol writer function)
passing benchmark_fold_options(inline) so inline expansion is honored. Update
run_fold_benchmark_with_runner to use run_fold_benchmark_with_writer with a
closure that calls write_folded_perfdata_file(input,
benchmark_fold_options(inline), writer) instead of run_fold_benchmark(input).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.ace-aarch64-unwind-spec.md:
- Around line 37-38: The spec incorrectly documents HEADER_ARCH as feature bit
5; update the documentation to state HEADER_ARCH is feature bit 6 to match the
implementation used by header_arch_from_file and header.rs (leave the perf
string format note intact: u32 len + bytes, see perf util/header.c
write_arch/read). Ensure any mention of "feature bit 5" is replaced with
"feature bit 6" so the doc and the code remain consistent.

In @.ace-research-perf-unwind.md:
- Around line 159-163: The fenced code blocks around the snippet containing
symbol_conf.hide_unresolved, entry->ms.sym, append_inlines and
callchain_cursor_append need blank lines before and after to satisfy MD031; add
an empty line immediately above the opening ``` and immediately below the
closing ``` for the blocks at the cited locations. Also fix the ordered-list
prefix style violations (MD029) in the block covering lines 347–357 by making
the list prefixes consistent (e.g., convert all list markers to numeric "1." or
to the chosen style) so the entire list uses a single, uniform ordered-list
prefix format.

In @.ace-review-findings.md:
- Around line 13-185: The markdown headings in the findings block (e.g., lines
starting with "### PERF-1", "### PERF-2", ..., "### CORR-1", etc.) lack the
required blank line(s) before/after them and trigger MD022; edit
.ace-review-findings.md to ensure each heading has an empty line above and below
(normalize to one blank line surrounding each top-level and subsection heading)
so all "### ..." headings are separated by a blank line from adjacent paragraphs
or other headings and the file passes markdownlint MD022.

In `@docs/parity-findings.md`:
- Line 34: Tighten the inline code span around `__libc_start_main` and the
adjacent `[unknown]` so there are no internal leading/trailing spaces inside the
backticks (remove any extra spaces between the backticks and the text) to
satisfy markdownlint MD038; update the text fragment containing
`__libc_start_main` and `[unknown]` accordingly.

In `@scripts/oracle/record-in-container.sh`:
- Around line 16-17: The find command substitution assigning PERF_BIN can fail
under set -euo pipefail and cause the script to exit before the subsequent guard
runs; change the assignment to make the command safe (e.g., append || true or
wrap the pipeline in a brace with || true) so PERF_BIN="$(find
/usr/lib/linux-tools* -name perf -type f 2>/dev/null | head -n 1 || true)" (or
equivalent) and keep the existing guard [ -n "$PERF_BIN" ] || { echo "no perf
binary found" >&2; exit 1; } to handle the missing binary.

In `@src/cargo_cli.rs`:
- Around line 627-637: When checking whether the package manifest is inside
crate_root, canonicalize the crate_root the same way you canonicalize
manifest_path so symlinked paths compare correctly; compute a
canonical_crate_root from crate_root (e.g.,
crate_root.canonicalize().as_deref().unwrap_or(&crate_root) or equivalent) and
use canonical_crate_root in the starts_with check instead of the raw crate_root
(references: crate_root, package_metadata.manifest_path, manifest_path,
find_crate_root).

In `@src/perfdata/build_id.rs`:
- Around line 145-153: Update the documentation for parse_build_id_event(s) to
reflect that these functions intentionally accept perf-style feature records
with header.type == 0 and do not error on a non-build-id record type; describe
that framing is based on the size field (read via read_u16 at offset 4) and
PERF_RECORD_MISC_BUILD_ID_SIZE is used to obtain build_id_size (offset into the
build_id area), so callers should not expect a type validation error anymore and
should rely on the size-based parsing semantics implemented in
parse_build_id_event and parse_build_id_events.

In `@src/perfdata/fold.rs`:
- Around line 3061-3063: The write call is emitting the raw path into perf
script output, skipping normalization for kernel DSO names; update the places
that write mapped-unknown-symbol frames (the write!(writer,
"{prefix}{address:16x} {label} ({path})") usages around the
mapped-unknown-symbol handling) to call perf_script_dso_name(path) first (as
done earlier with let path = perf_script_dso_name(path)), ensuring every place
that formats "({path})" uses the normalized path returned by
perf_script_dso_name so kernel mappings like `[kernel.kallsyms]_...` are
normalized consistently.

In `@src/perfdata/unwind.rs`:
- Around line 485-508: The match currently swallows architecture/register
mismatches in the unwind loop (the `_ => {}` arm), hiding contract drift between
the PerfUnwinder variant and the PerfUserRegs variant; replace that silent no-op
with an explicit failure signal (e.g., panic!, unreachable!, or a debug_assert!)
that includes the mismatched values so it is diagnosable — update the catch-all
arm in the match over (self, regs) to call something like
unreachable!("arch/register mismatch: unwinder={:?} regs={:?}", self, regs) or
debug_assert!(false, ...) (ensure PerfUserRegs and the unwinder type implement
Debug or format a descriptive message) so mismatches in the code paths (the
branches using unwinder.iter_frames, regs.to_framehop_regs,
push_perf_unwind_address) are surfaced instead of returning empty frames.
- Around line 410-419: The memo in has_unwind_info_for_ip currently inserts
every queried IP into unwind_info_memo which can grow unbounded; change
has_unwind_info_for_ip to bound the cache by checking
unwind_info_memo.borrow().len() (or a tracked count) against a small
MAX_UNWIND_MEMO_ENTRIES constant and, when the limit is reached, either clear
the map (or skip inserting new entries) before inserting the result; keep the
existing lookup logic and still clear the memo when reported_modules changes,
but ensure you reference has_unwind_info_for_ip, unwind_info_memo, and
reported_modules so the cache trimming/reset is applied just prior to the insert
to prevent unbounded growth.

In `@src/platform.rs`:
- Around line 66-74: The current read_dir pipeline for building tids silently
drops errors via filter_map(Result::ok); update it to either log each dropped
error or propagate the first encountered error so callers can distinguish an
empty dir from read failures: modify the iterator that uses read_dir(&task_dir)?
and filter_map(Result::ok) so errors are handled explicitly (e.g., replace
filter_map(Result::ok) with a map that matches Result::Err(e) and either calls
log::warn! with context including task_dir and e before returning None, or
returns Err(e) to propagate using a try/collect pattern), keeping the rest of
the file-name-to-u32 parsing and the tids Vec<u32> collection unchanged.

---

Outside diff comments:
In `@src/benchmarks.rs`:
- Around line 369-385: The inline flag is only passed to run_pyroclast_stream
causing Inferno to still serialize perf-script input with
benchmark_fold_options(false); update run_streaming_comparison_with_symbols to
forward the inline boolean into run_inferno_stream (in the scope.spawn call) and
ensure the Inferno producer uses benchmark_fold_options(true) when inline is
true so both producers use the same frame-model; locate the
run_streaming_comparison_with_symbols function and adjust the call to
run_inferno_stream and any benchmark_fold_options usage to accept and respect
the inline parameter (also apply the same change to the other occurrence
mentioned).
- Around line 187-208: The non-symbol branch in run_fold_benchmark_with_runner
ignores the inline parameter by calling run_fold_benchmark(input) which uses a
hardcoded benchmark_fold_options(false); change the else branch to call
run_fold_benchmark_with_writer and invoke write_folded_perfdata_file (or the
equivalent non-symbol writer function) passing benchmark_fold_options(inline) so
inline expansion is honored. Update run_fold_benchmark_with_runner to use
run_fold_benchmark_with_writer with a closure that calls
write_folded_perfdata_file(input, benchmark_fold_options(inline), writer)
instead of run_fold_benchmark(input).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 60950ccd-1830-4324-b7b2-6f0eaa7f4a0c

📥 Commits

Reviewing files that changed from the base of the PR and between dee5c3b and b940ddc.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (34)
  • .ace-aarch64-unwind-spec.md
  • .ace-research-perf-unwind.md
  • .ace-review-findings.md
  • .beads/issues.jsonl
  • .gitignore
  • Cargo.toml
  • docs/parity-findings.md
  • scripts/oracle-compare
  • scripts/oracle/Dockerfile
  • scripts/oracle/compare-in-container.sh
  • scripts/oracle/record-in-container.sh
  • scripts/oracle/run-in-container.sh
  • scripts/oracle/workload.rs
  • scripts/perf-oracle
  • src/backends/linux_perf.rs
  • src/benchmarks.rs
  • src/cargo_cli.rs
  • src/cli.rs
  • src/lib.rs
  • src/perfdata/build_id.rs
  • src/perfdata/fold.rs
  • src/perfdata/header.rs
  • src/perfdata/mappings.rs
  • src/perfdata/unwind.rs
  • src/platform.rs
  • src/symbols.rs
  • tests/benchmarks.rs
  • tests/perfdata_build_id.rs
  • tests/perfdata_fold.rs
  • tests/perfdata_header.rs
  • tests/perfdata_mappings.rs
  • tests/perfdata_unwind.rs
  • tests/run_cli.rs
  • tests/symbols.rs
💤 Files with no reviewable changes (1)
  • Cargo.toml
👮 Files not reviewed due to content moderation or server errors (8)
  • src/perfdata/header.rs
  • tests/perfdata_header.rs
  • src/perfdata/mappings.rs
  • src/symbols.rs
  • scripts/oracle-compare
  • scripts/oracle/Dockerfile
  • scripts/oracle/compare-in-container.sh
  • scripts/oracle/workload.rs

Comment on lines +37 to +38
`header.rs`). Add `header_arch_from_file` reading HEADER_ARCH (feature bit 5;
perf string format: u32 len + bytes, see perf util/header.c write_arch/read).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Correct the HEADER_ARCH feature-bit value in the spec.

Line 37 says HEADER_ARCH is feature bit 5, but the implementation uses bit 6; this mismatch will mislead future maintenance/debugging.

📝 Suggested doc fix
-`header.rs`). Add `header_arch_from_file` reading HEADER_ARCH (feature bit 5;
+`header.rs`). Add `header_arch_from_file` reading HEADER_ARCH (feature bit 6;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
`header.rs`). Add `header_arch_from_file` reading HEADER_ARCH (feature bit 5;
perf string format: u32 len + bytes, see perf util/header.c write_arch/read).
`header.rs`). Add `header_arch_from_file` reading HEADER_ARCH (feature bit 6;
perf string format: u32 len + bytes, see perf util/header.c write_arch/read).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.ace-aarch64-unwind-spec.md around lines 37 - 38, The spec incorrectly
documents HEADER_ARCH as feature bit 5; update the documentation to state
HEADER_ARCH is feature bit 6 to match the implementation used by
header_arch_from_file and header.rs (leave the perf string format note intact:
u32 len + bytes, see perf util/header.c write_arch/read). Ensure any mention of
"feature bit 5" is replaced with "feature bit 6" so the doc and the code remain
consistent.

Comment on lines +159 to +163
```c
if (symbol_conf.hide_unresolved && entry->ms.sym == NULL) return 0;
if (append_inlines(cursor, &entry->ms, entry->ip, ...) == 0) return 0;
... callchain_cursor_append(cursor, entry->ip, &entry->ms, ..., srcline);
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve markdownlint violations in this doc block.

Line 159/Line 163 trip MD031 (fenced blocks need surrounding blank lines), and Line 347/Line 357 trip MD029 (ordered-list prefix style). These are quick fixes and remove lint noise.

Also applies to: 347-357

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 159-159: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 163-163: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.ace-research-perf-unwind.md around lines 159 - 163, The fenced code blocks
around the snippet containing symbol_conf.hide_unresolved, entry->ms.sym,
append_inlines and callchain_cursor_append need blank lines before and after to
satisfy MD031; add an empty line immediately above the opening ``` and
immediately below the closing ``` for the blocks at the cited locations. Also
fix the ordered-list prefix style violations (MD029) in the block covering lines
347–357 by making the list prefixes consistent (e.g., convert all list markers
to numeric "1." or to the chosen style) so the entire list uses a single,
uniform ordered-list prefix format.

Source: Linters/SAST tools

Comment thread .ace-review-findings.md
Comment on lines +13 to +185
### PERF-1 (CRITICAL) — DWARF is fully re-parsed for every object on every FINISHED_ROUND and every prefetch batch
- **Where:** `src/symbols.rs:1718` and `src/symbols.rs:1864` (`PerfDwarfNameResolver::from_object_bytes_for_addresses`), defined at `src/symbols.rs:2412-2455`. Driven from `src/perfdata/fold.rs:745` (`drain_fold_counts` per `PERF_RECORD_FINISHED_ROUND`) → `accumulate_fold_counts` (`fold.rs:1921`) → `prefetch_symbols` (`fold.rs:1882`) → `prefetch_mapping_refs` → `resolve_frame_batch_with_metadata`.
- **What's wrong:** `CachedObjectMetadata` (`symbols.rs:265`) caches `object_symbols` and `debug_names`, but **not** the parsed DWARF (`PerfDwarfNameResolver`). Every call to `resolve_frame_batch_with_metadata` rebuilds the entire DWARF index from the object bytes: `object::File::parse` + `gimli::DwarfSections::load` (decompresses every debug section) + iterates **every** compilation unit and builds DIE trees (`perf_dwarf_unit_roots`, `symbols.rs:2445`). The `addresses` filter only skips *building roots* for non-matching units; it still parses all sections and walks all unit headers each time.
- **Why it matters:** `drain_fold_counts` runs once per `FINISHED_ROUND` — perf emits many rounds per capture. Each round re-parses the full DWARF of every DSO touched that round (libc, the main binary, etc.). A profile with R rounds and a hot DSO re-parses that DSO's DWARF ~R times. This is the dominant cost and the direct cause of the rc=124 timeout once synthetic fallbacks (which skipped DWARF) were removed. For a glibc-sized binary a single DWARF parse is tens of ms; multiplied by rounds × objects this is seconds-to-minutes.
- **Fix:** Cache the parsed `PerfDwarfNameResolver` (address-agnostic, i.e. `from_object_bytes_matching_addresses(bytes, None)`) inside `CachedObjectMetadata`, behind the same per-path `Mutex`/`Arc` as the rest of the metadata, and reuse it across all rounds and batches. `frame_names_for_base_symbol` already does its own per-address range filtering, so a once-built full index is correct. Alternatively drain symbols once at the end instead of per round (but the per-object cache is the real fix and also helps the final drain).
- **Confidence:** high.

### PERF-2 (CRITICAL) — Mapping insert is O(n) per record and rebuilds all per-pid indexes unconditionally → O(n²)+ over the record stream
- **Where:** `src/perfdata/mappings.rs:183-217` (`insert_mapping` → `remove_overlapping_mappings_like_perf`), with `rebuild_pid_indexes` at `mappings.rs:261-310`. Called from `fold.rs:1418/1440/1450` for every MMAP/MMAP2/MMAP2-build-id record.
- **What's wrong:** `remove_overlapping_mappings_like_perf` `std::mem::take`s the whole `mappings` vector and iterates **all** of it (line 194) on every insert, then **unconditionally** calls `rebuild_pid_indexes()` (line 215) — which clears and rebuilds every pid's interval index from scratch. `rebuild_pid_indexes` itself uses `bucket.insert(position, …)` in a loop (line 282), which is O(m) per insert → O(m²) for a bucket of m mappings. So each mmap record is at least O(total mappings), and the index rebuild is quadratic in the largest pid's mapping count. There is no early-out for the common no-overlap case.
- **Why it matters:** mmap/mmap2 records occur on every shared-library load and every fork+exec. For traces with tens of thousands of mappings this is the dominant *parse-side* cost and produces large transient reallocations every record. Compounds PERF-3 and PERF-4.
- **Fix:** Test for overlap first using the existing per-pid interval index (`mappings_by_pid`); if nothing overlaps, do a pure incremental `insert_mapping_without_overlap_fix` and skip the take + global rebuild entirely. Only on an actual split should you repair the affected pid bucket (not all pids).
- **Confidence:** high.

### PERF-3 (HIGH) — `has_overlapping_user_mapping_for_pid` linear-scans ALL mappings on every mmap record
- **Where:** `src/perfdata/mappings.rs:360-373`; called from `fold.rs:1413/1430/1445` via `invalidate_pid_unwinder_if_mapping_overlaps_like_perf` (`fold.rs:1485-1500`) for every MMAP/MMAP2 record.
- **What's wrong:** `self.mappings.iter().any(...)` scans the entire global mapping table (all pids) and filters by pid in the closure. This runs once per mmap record, independent of and additional to PERF-2's quadratic insert. O(N) per record × N records = O(N²).
- **Why it matters:** Doubles down on the mmap-record quadratic. The per-pid interval index already exists and answers this in O(log n + matches).
- **Fix:** Route through `mappings_by_pid.get(&pid)` and use the `partition_point` / `max_end` interval search (the same pattern `resolve_mapping_index_for_pid` already uses).
- **Confidence:** high.

### PERF-4 (HIGH) — Per-sample full stack unwind is re-run up to 9× via the module-report retry loop
- **Where:** `src/perfdata/fold.rs:3160-3178` (`unwind_object_frame_addresses_like_perf`), loop bound `MAX_LIBDW_CALLBACK_REPORT_PASSES = 8` (`fold.rs:123`).
- **What's wrong:** For each sample with a user stack, `unwind_user_stack_with_diagnostics` (the framehop unwind over the whole captured stack) is called once, then re-called from scratch on each pass that reports new modules, comparing `next_unwind == object_unwind` (full `Vec<u64>` + diagnostics equality) to decide convergence. Worst case 1 + 8 = 9 complete re-unwinds per sample, each walking the entire stack image again. The good news: the framehop `Unwinder` + `Cache` are correctly reused across samples (held in `PidUnwindState.object_unwinder`), and the stack bytes are borrowed (`perf_effective_user_stack_bytes`, `fold.rs:3381`, correctly truncates to `dynamic_size`). The cost is the repeated full unwinds, not cache rebuilds.
- **Why it matters:** Multiplies the per-sample unwind cost (already the heaviest per-sample work) by up to 9 for samples that touch not-yet-reported modules — typical early in a process's life or for deep stacks crossing many DSOs. Contributes materially to the timeout.
- **Fix:** Report all modules for the IPs *before* unwinding when possible, or incrementally extend the existing unwind from the first unresolved frame rather than restarting. At minimum, break as soon as a pass reports zero *new* modules (the current code already does, but it still re-unwinds once more to detect equality — cache the "did we add a module" boolean from `report_unwind_modules_for_frame_callbacks_like_perf` and skip the final redundant unwind when it returned false).
- **Confidence:** medium-high (the re-unwind loop is real; exact multiplier depends on trace).

### PERF-5 (HIGH) — `normalize_ranges` is O(n²) over FDE ranges at module-add time
- **Where:** `src/perfdata/unwind.rs:476-487` — `ranges.remove(index + 1)` inside a `while` loop.
- **What's wrong:** `Vec::remove` is O(n); doing it in the merge loop makes range normalization O(n²) in the FDE count. Large DSOs (libc, the main binary) have tens of thousands of FDEs.
- **Why it matters:** Per-module startup cost; seconds for large binaries, paid once per loaded unwind module. Not per-sample, but adds to total fold time.
- **Fix:** Single forward pass writing merged ranges into a new `Vec` (or in-place with a write index), instead of `remove` per merge.
- **Confidence:** high.

### PERF-6 (HIGH) — `read_reported_module_u64` / `has_unwind_info_for_ip` are linear scans over all module segments / FDE ranges
- **Where:** `src/perfdata/unwind.rs:508-518` (`read_reported_module_u64`, the module-memory fallback closure at `unwind.rs:281-283`), and `unwind.rs:230-248` (`has_reported_module_for_ip`, `has_unwind_info_for_ip`), plus the `rejected_mapping_ranges` scan at `unwind.rs:273-277`.
- **What's wrong:** Each out-of-stack memory read framehop requests (CFA expression eval, return-address fetch, GOT/PLT/data) does `modules.iter().flat_map(segments).find_map(...)` — O(modules × segments) per read. `has_unwind_info_for_ip` nests `any` over `unwind_ranges` (thousands of FDE ranges per module). framehop issues many reads per frame and there can be hundreds of mapped DSOs.
- **Why it matters:** O(samples × frames × reads × modules×segments). For a process with many DSOs this can dwarf the actual unwind math.
- **Fix:** Build a single sorted `Vec<(addr_range, module/segment idx)>` (or interval tree) once at module-add time and binary-search. The unwind ranges already pass through `normalize_ranges` (sorted) — search them rather than linear-scan.
- **Confidence:** medium-high.

### PERF-7 (MEDIUM) — Per-sample full payload copy of every SAMPLE record via `to_vec()`
- **Where:** `src/perfdata/records.rs:677-682` (`parse_sample_payload_record`), reached from `fold.rs:756` (`parse_record_with_context`). Also `parse_read_record` (`records.rs:661`), `parse_aux_record` (`records.rs:698`), switch/namespaces.
- **What's wrong:** `SamplePayloadRecord.payload` is `Vec<u8>` built with `payload.to_vec()`. The fold path *does* need an owned copy for **queued** (deferred-by-time) records, because the reused `payload` buffer in `write_folded_perfdata_from_file` (`fold.rs:711,738`) is overwritten on the next iteration. But records that are applied immediately (`apply_or_queue` with `time == None`, `fold.rs:1120-1125`) still pay the full heap copy + memcpy of the largest record type.
- **Why it matters:** One heap alloc + full payload memcpy per sample. With millions of samples this is a top-3 allocation cost on the parse side.
- **Fix:** For the immediate-apply path, decode the sample directly from the borrowed `&payload` slice without `to_vec()`. Only copy when actually queuing. (Requires splitting the apply path so the borrowed case never constructs an owned `ParsedRecord::Sample`.)
- **Confidence:** medium-high (copy is real; some copies are genuinely required for the time-ordered queue).

### PERF-8 (MEDIUM) — `OrderedRecordQueue` re-sorts the entire pending buffer on every flush round
- **Where:** `src/perfdata/fold.rs:1191-1192` (`flush_through_with`), called from `flush_round` each `FINISHED_ROUND`.
- **What's wrong:** `pending_records.sort_by_key((time, index))` sorts the whole remaining buffer every round, even the tail that stays queued past the flush limit. Records are already nearly time-ordered, but `sort_by_key` is a full O(k log k) each round.
- **Why it matters:** With many rounds and a large in-flight window this is repeated near-sorted sorting. Moderate.
- **Fix:** Use a min-heap keyed by `(time, index)` and pop up to the limit, or only sort the newly-added suffix and merge. At minimum `sort_unstable_by_key`.
- **Confidence:** medium.

### PERF-9 (MEDIUM) — `ranked_*` in analysis format every distinct key to a String before truncating to `limit`
- **Where:** `src/perfdata/analysis.rs:205-264` (`ranked_threads`/`ranked_ips`/`ranked_edges`), `format_ip` at `analysis.rs:268`.
- **What's wrong:** Builds the full Vec including `format!("tid {tid}")` and `format!("0x{ip:016x}")` for every distinct key, then sorts, then `truncate(limit)`. Edge/IP cardinality can be hundreds of thousands; only `limit` survive.
- **Why it matters:** O(distinct) string allocations discarded. This is the `parse perf summary` path, not the fold path, so lower priority, but wasteful.
- **Fix:** `select_nth_unstable_by` on the numeric weight first, then format only the surviving `limit`.
- **Confidence:** high.

### PERF-10 (MEDIUM) — `analyze_perfdata` allocates a fresh `frames` Vec per sample
- **Where:** `src/perfdata/analysis.rs:107-112` — `.collect::<Vec<_>>()` per sample.
- **What's wrong:** One heap alloc/free per sample in the summary loop.
- **Fix:** Hoist a reusable `Vec<u64>` and `clear()` per iteration.
- **Confidence:** high (summary path only).

### PERF-11 (LOW) — `object_metadata` reads the entire DSO into the heap (`std::fs::read`) rather than mmap
- **Where:** `src/symbols.rs:515` and `src/symbols.rs:579`.
- **What's wrong:** `std::fs::read(path)` loads the whole object file into an owned `Arc<[u8]>`. For large binaries with many DSOs this is significant resident memory (though cached once per path, which is correct).
- **Why it matters:** Memory footprint, not CPU. Bounded by distinct DSO count. Low.
- **Fix:** `memmap2` the object read-only and parse from the mmap (note the SIGBUS-on-truncation tradeoff, ROB-2).
- **Confidence:** low-medium.

### PERF-12 (LOW) — Per-sample `PathBuf`/String/Vec clones in the unwind setup
- **Where:** `src/perfdata/fold.rs:3121` (`unwind_debug_dir.clone()` per sample), `fold.rs:3291-3292` (`mapping.path.to_string()` + `build_id.to_vec()` per module-report attempt), `fold.rs:3117-3131` (`unwind_states.remove` + reinsert per sample instead of `entry()`).
- **What's wrong:** A `PathBuf` clone per unwound sample; path/build-id clones each report attempt; HashMap remove+insert churn per sample.
- **Why it matters:** Minor constant per-sample overhead; the path-report clones only fire on cache-miss module loads. Low.
- **Fix:** Pass `unwind_debug_dir` as `Option<&Path>` (already `Option<&Path>` downstream); use `entry(pid)` instead of remove+reinsert; defer path/build-id owning to the miss path.
- **Confidence:** medium (clearly wasteful, low total impact).

### Verified NON-issues (do not re-investigate)
- The framehop `Unwinder` and `Cache` are correctly created once per pid and reused across samples (`PidUnwindState.object_unwinder`); the standalone `unwind_x86_64_stack` free function (`unwind.rs:84`) that rebuilds a `Cache` per call is a test/utility shim, not on the fold path.
- The captured user stack is borrowed (not copied) per sample and correctly truncated to `dynamic_size` (`fold.rs:3381`).
- `SymbolFrameCache` correctly memoizes per `(symbol_source_id, relative_address)` across rounds and batches misses (`symbols.rs:162-171, 1171-1231`); the remaining cost is the DWARF re-parse (PERF-1), not the cache.
- `MappingResolveCache` is `Copy`/stack-only (`mappings.rs:42`), so per-sample creation is free.
- `FoldCounts` coalescing uses an FxHash-bucketed dedup with arena storage (`fold.rs:283-364`) and sorts once at the end (`fold.rs:1962`) — no per-sample sort.

---

## CORRECTNESS

### CORR-1 (HIGH) — `file_matches_recorded_identity` compares inode only, ignoring device (major/minor) and generation
- **Where:** `src/perfdata/mappings.rs:59-61`.
- **What's wrong:** `metadata.ino() == identity.inode` ignores `identity.major`, `identity.minor`, `inode_generation`, which `FileIdentity` (`mappings.rs:49-55`) carries precisely to disambiguate. Inode numbers are unique only per filesystem; two files on different mounts can share an inode.
- **Why it matters:** A wrong on-disk file can be accepted as matching a recorded mapping → symbols from the wrong binary. Silent mis-symbolization.
- **Fix:** Also compare device (`metadata.dev()` decomposed via `libc::major/minor`) and generation where available.
- **Confidence:** high.

### CORR-2 (MEDIUM) — Same physical file gets two `symbol_source_id`s depending on which mmap record form delivered it
- **Where:** `src/perfdata/mappings.rs:146-158` (`insert_mmap2_build_id` sets `file_identity: None`) vs `mappings.rs:127` (`insert_mmap2_with_build_id` keeps it); `symbol_source_key` at `mappings.rs:599-610`.
- **What's wrong:** The dedup key mixes `(path, build_id, file_identity, …)`. An inline-build-id MMAP2 keys as `(path, Some(build_id), None)` while the same file arriving as plain MMAP2 + HEADER_BUILD_ID keys as `(path, Some(build_id), Some(identity))`. Different keys → two `symbol_source_id`s for one file.
- **Why it matters:** Defeats the per-mapping symbol cache dedup (`SymbolFrameCache` is keyed on `symbol_source_id`), so the same DSO's DWARF gets resolved twice — amplifies PERF-1. Also splits fold counts if frame labels differ.
- **Fix:** When a build_id is present, key the symbol source on `(path, build_id)` and ignore `file_identity`; only fall back to `file_identity` when no build_id exists.
- **Confidence:** medium.

### CORR-3 (MEDIUM) — `analysis.rs` thread keying/labeling mismatch: keyed by `tid.or(pid)`, labeled from a tid→comm map
- **Where:** `src/perfdata/analysis.rs:104` (key = `sample.tid.or(sample.pid)`), `analysis.rs:140` + `200-216` (param named `comms_by_pid` but caller passes `comms_by_tid`, lookup at `analysis.rs:209`).
- **What's wrong:** Bucket key falls back to `pid` when `tid` is absent, but the comm lookup map is keyed by tid. A pid-only sample is counted under its pid value and may pick up a different thread's comm (if some tid equals that pid value) or fall back to `tid {n}`.
- **Why it matters:** Mislabeled/merged thread rows in `parse perf summary`. Report-correctness, not fold-correctness.
- **Fix:** Pick one key (tid) and only bucket samples that have it, or carry both pid+tid and look up comm with the matching key. Rename the misleading `comms_by_pid` param.
- **Confidence:** medium.

### CORR-4 (MEDIUM) — `build_id.rs::filename` hard-fails on non-UTF-8 paths while the rest of the parser uses lossy UTF-8
- **Where:** `src/perfdata/build_id.rs:198-206` (returns `Err` on non-UTF-8) vs `src/perfdata/records.rs:1018-1024` (`from_utf8_lossy`).
- **What's wrong:** Linux paths are bytes, not UTF-8. A legal perf.data with a non-UTF-8 mapped path parses fine as a normal MMAP path but causes `build_id_events_from_perfdata` (and thus kernel build-id lookup) to error out entirely.
- **Why it matters:** Build-id-based symbol/unwind resolution silently disabled for the whole file on an edge-case path. Robustness regression.
- **Fix:** Use `from_utf8_lossy` for the build-id filename too (or store bytes), matching `parse_c_string`.
- **Confidence:** medium.

### CORR-5 (LOW/MEDIUM) — Build-id feature section: trailing padding turns into a hard parse error
- **Where:** `src/perfdata/build_id.rs:36-37` — loop guard `offset < payload.len()` admits a 1–7 byte tail; `read_u16(payload, offset+6)` then errors on a partial trailing record.
- **What's wrong:** perf feature payloads are not guaranteed record-aligned at the tail; benign padding aborts the entire build-id extraction (returns `Err`, not "end of section").
- **Why it matters:** Same downstream effect as CORR-4 — build-id resolution disabled — for files with trailing padding.
- **Fix:** Require `offset + 8 <= payload.len()` to read a header; treat a smaller remainder as end-of-section, not error.
- **Confidence:** medium.

### CORR-6 (LOW) — Endianness is assumed little-endian throughout; the file's recorded byte order is not consulted
- **Where:** `src/perfdata/endian.rs:6-44` (`read_u16/u32/u64` are hard-coded `from_le_bytes`); header magic check at `src/perfdata/header.rs:31` only accepts `b"PERFILE2"` (LE), never the byte-swapped magic.
- **What's wrong:** A big-endian-recorded perf.data (byte-swapped `PERFILE2`) is rejected at the magic check rather than misparsed — so this is *safe* today (clean rejection), but cross-endian capture is simply unsupported and there is no byte-order plumbing. Flagging because the review brief called it out: it is a latent landmine if anyone adds support for the swapped magic without threading endianness through `read_u*`.
- **Why it matters:** No live bug (rejected cleanly). Future-proofing only.
- **Fix:** Either document LE-only explicitly, or detect endianness from the magic and thread it through the `read_u*` helpers.
- **Confidence:** high (that it's currently safe); low (impact).

### CORR-7 (LOW) — `is_syscall_return_state` silently false-negatives when RCX/R11 weren't captured
- **Where:** `src/perfdata/unwind.rs:615`, populated at `unwind.rs:645,661`; consumed in `build_user_unwind_context` (`fold.rs:3032`).
- **What's wrong:** RCX/R11 are only populated if the perf register mask included bits 2 and 19. With a minimal capture mask (ip/sp/bp only) they read 0 and the function returns false, silently disabling syscall-return special-casing.
- **Why it matters:** Unwind accuracy at syscall boundaries for minimally-captured samples; no crash.
- **Fix:** Track register presence (bitmask or `Option`) so "not captured" is distinguishable from "value is 0".
- **Confidence:** medium.

### Verified-correct (checked, no bug)
- x86_64 perf-register → framehop `Reg` mapping is correct, and ignored perf reg bits (segments/flags, bits 9–15) still advance the value iterator so no desync (`unwind.rs:635-668`). The `1 << register` shift is `u64` by inference, no overflow at bit 63.
- `parse_sample_record` field ordering matches the kernel `PERF_RECORD_SAMPLE` serialization order, not flag order (`samples.rs:134-217`).
- Sample/record length math uses `checked_mul`/`checked_add` and validates payload length before `Vec::with_capacity` (e.g. `records.rs:758-771`, `samples.rs:301-306`), so attacker-controlled counts don't OOM or overflow.
- `Mapping::relative_address` subtraction is protected by the resolver invariant `start <= ip < end` (`mappings.rs:565-571`); safe but fragile (see ROB-1).

---

## ROBUSTNESS (panics / hangs reachable from malformed input)

### ROB-1 (LOW) — `Mapping::relative_address` does unchecked `ip - self.start`
- **Where:** `src/perfdata/mappings.rs:565-571`. Currently safe via the resolver invariant, but every other subtraction in the file uses `saturating_*`. A future caller passing an unresolved mapping underflows (debug panic / release wrap).
- **Fix:** `ip.saturating_sub(self.start)` defensively.
- **Confidence:** low.

### ROB-2 (LOW/MEDIUM) — `unsafe { Mmap::map(&file) }` on profiled DSOs can SIGBUS if the file is truncated/replaced
- **Where:** `src/perfdata/unwind.rs:168-171` (and PERF-11 if symbols.rs switches to mmap).
- **What's wrong:** Inherent mmap hazard: a DSO unlinked/replaced/truncated after mapping faults with SIGBUS on read — uncatchable by `Result`. Profiling targets routinely replace binaries.
- **Why it matters:** Low probability, process-crash severity.
- **Fix:** Accept as a documented tradeoff, or read sections into owned buffers for untrusted paths, or install a SIGBUS handler.
- **Confidence:** medium.

### ROB-3 (LOW) — Unwind loop bounded only by caller-supplied `max_frames`
- **Where:** `src/perfdata/unwind.rs:90-95` / `291-296`; fold callers pass `256` (`fold.rs:3161,3173`), so the live path is bounded. A caller passing `usize::MAX` plus pathological CFI could spin.
- **Fix:** Internal hard cap (e.g. `max_frames.min(1024)`).
- **Confidence:** low (current callers are safe).

---

## Recommended fix order (impact-weighted)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize heading spacing to satisfy MD022 across sections.

Multiple headings in this range are missing required surrounding blank lines per markdownlint (MD022). Please normalize heading spacing throughout this block to keep docs lint clean.

🧰 Tools
🪛 LanguageTool

[grammar] ~21-~21: Ensure spelling is correct
Context: ...fold.rs:1418/1440/1450 for every MMAP/MMAP2/MMAP2-build-id record. - What's wrong: `remove_overlapping_mapp...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[uncategorized] ~34-~34: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...* high. ### PERF-4 (HIGH) — Per-sample full stack unwind is re-run up to 9× via the modul...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[style] ~41-~41: It might be better to use ‘time’ with the time-relative pronoun ‘when’. (Alternatively, use ‘in/on which’.)
Context: ... is O(n²) over FDE ranges at module-add time - Where: src/perfdata/unwind.rs:476-487 — `...

(WHEN_WHERE)

🪛 markdownlint-cli2 (0.22.1)

[warning] 13-13: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 20-20: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 27-27: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 34-34: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 41-41: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 48-48: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 55-55: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 62-62: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 69-69: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 76-76: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 82-82: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 89-89: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 96-96: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 107-107: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 114-114: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 121-121: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 128-128: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 135-135: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 142-142: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 149-149: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 156-156: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 166-166: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 171-171: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 178-178: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 185-185: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.ace-review-findings.md around lines 13 - 185, The markdown headings in the
findings block (e.g., lines starting with "### PERF-1", "### PERF-2", ..., "###
CORR-1", etc.) lack the required blank line(s) before/after them and trigger
MD022; edit .ace-review-findings.md to ensure each heading has an empty line
above and below (normalize to one blank line surrounding each top-level and
subsection heading) so all "### ..." headings are separated by a blank line from
adjacent paragraphs or other headings and the file passes markdownlint MD022.

Source: Linters/SAST tools

Comment thread docs/parity-findings.md
now byte-identical to `perf script --force` on the fp oracle, and the folded output
differs only where `inferno-collapse-perf` itself mis-parses the space-containing
DSO path `/ (deleted)` (it keeps the `+0x9c` offset and a trailing space on
`__libc_start_main` and emits `[unknown] `); pyroclast's folding of those frames is

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix MD038 code-span spacing at this line.

Line 34 includes a code span that markdownlint flags for internal spacing (MD038). Tightening that span formatting will clear the warning.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 34-34: Spaces inside code span elements

(MD038, no-space-in-code)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/parity-findings.md` at line 34, Tighten the inline code span around
`__libc_start_main` and the adjacent `[unknown]` so there are no internal
leading/trailing spaces inside the backticks (remove any extra spaces between
the backticks and the text) to satisfy markdownlint MD038; update the text
fragment containing `__libc_start_main` and `[unknown]` accordingly.

Source: Linters/SAST tools

Comment on lines +16 to +17
PERF_BIN="$(find /usr/lib/linux-tools* -name perf -type f 2>/dev/null | head -n 1)"
[ -n "$PERF_BIN" ] || { echo "no perf binary found" >&2; exit 1; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Verifies current shell behavior: command substitution exits early under pipefail.
if bash -lc 'set -euo pipefail; PERF_BIN="$(find /definitely-missing/linux-tools* -name perf -type f 2>/dev/null | head -n 1)"; echo reached'; then
  echo "unexpected: did not fail"
else
  echo "expected: assignment failed before fallback guard"
fi

Repository: mjc/pyroclast

Length of output: 224


Prevent premature exit during perf binary discovery under set -euo pipefail.

With set -euo pipefail, the PERF_BIN="$(find ... | head -n 1)" command substitution can terminate the script when find fails (e.g., /usr/lib/linux-tools* missing), so the intended Line 17 guard ([ -n "$PERF_BIN" ] ...) may never run.

💡 Suggested fix
 if ! perf version >/dev/null 2>&1; then
-    PERF_BIN="$(find /usr/lib/linux-tools* -name perf -type f 2>/dev/null | head -n 1)"
+    PERF_BIN="$(find /usr/lib/linux-tools* -name perf -type f 2>/dev/null | head -n 1 || true)"
     [ -n "$PERF_BIN" ] || { echo "no perf binary found" >&2; exit 1; }
     perf() { "$PERF_BIN" "$@"; }
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/oracle/record-in-container.sh` around lines 16 - 17, The find command
substitution assigning PERF_BIN can fail under set -euo pipefail and cause the
script to exit before the subsequent guard runs; change the assignment to make
the command safe (e.g., append || true or wrap the pipeline in a brace with ||
true) so PERF_BIN="$(find /usr/lib/linux-tools* -name perf -type f 2>/dev/null |
head -n 1 || true)" (or equivalent) and keep the existing guard [ -n "$PERF_BIN"
] || { echo "no perf binary found" >&2; exit 1; } to handle the missing binary.

Comment thread src/perfdata/build_id.rs
Comment on lines 145 to +153
fn parse_build_id_event(record: &[u8]) -> Result<BuildIdEvent, String> {
let record_type = read_u32(record, 0)?;
if record_type != PERF_RECORD_HEADER_BUILD_ID {
return Err(format!(
"expected PERF_RECORD_HEADER_BUILD_ID, got {record_type}"
));
}
// tools/perf/util/build-id.c write_buildid() emits these feature-section
// records WITHOUT setting header.type (it leaves it 0) and sets
// PERF_RECORD_MISC_BUILD_ID_SIZE in misc with the real build-id length in
// the size byte at offset 20 of the 24-byte build_id field (record offset
// 32). Do not gate on the record type; rely on the size field (offset 6)
// for record framing, as perf_header__read_build_ids does.
let misc = read_u16(record, 4)?;
let build_id_size = if misc & PERF_RECORD_MISC_BUILD_ID_SIZE != 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the documented error contract after removing type validation.

parse_build_id_event now intentionally accepts perf-style feature records with header.type == 0, but the parse_build_id_events docs still state it errors on wrong record type. Please align docs with current behavior.

Suggested diff
-/// Returns an error when a record is truncated, has the wrong type, or carries
-/// non-UTF-8 filename data.
+/// Returns an error when a record is truncated or carries non-UTF-8 filename
+/// data.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/perfdata/build_id.rs` around lines 145 - 153, Update the documentation
for parse_build_id_event(s) to reflect that these functions intentionally accept
perf-style feature records with header.type == 0 and do not error on a
non-build-id record type; describe that framing is based on the size field (read
via read_u16 at offset 4) and PERF_RECORD_MISC_BUILD_ID_SIZE is used to obtain
build_id_size (offset into the build_id area), so callers should not expect a
type validation error anymore and should rely on the size-based parsing
semantics implemented in parse_build_id_event and parse_build_id_events.

Comment thread src/perfdata/fold.rs
Comment on lines +3061 to 3063
let path = perf_script_dso_name(path);
write!(writer, "{prefix}{address:16x} {label} ({path})")
.map_err(|error| format!("failed to write perf script output: {error}"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize kernel DSO names in mapped-unknown-symbol frames too.

Line 3089 writes ({path}) directly, so kernel mappings such as [kernel.kallsyms]_... bypass perf_script_dso_name and can diverge from perf-script output on unsymbolized paths.

🐛 Proposed fix
 fn write_perf_script_mapped_unknown_symbol_frame_fragment<W>(
     writer: &mut W,
     prefix: &str,
     address: u64,
     path: &str,
 ) -> Result<(), String>
 where
     W: IoWrite + ?Sized,
 {
-    write!(writer, "{prefix}{address:16x} {UNKNOWN_FRAME} ({path})")
+    let path = perf_script_dso_name(path);
+    write!(writer, "{prefix}{address:16x} {UNKNOWN_FRAME} ({path})")
         .map_err(|error| format!("failed to write perf script output: {error}"))
 }

Also applies to: 3080-3090

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/perfdata/fold.rs` around lines 3061 - 3063, The write call is emitting
the raw path into perf script output, skipping normalization for kernel DSO
names; update the places that write mapped-unknown-symbol frames (the
write!(writer, "{prefix}{address:16x} {label} ({path})") usages around the
mapped-unknown-symbol handling) to call perf_script_dso_name(path) first (as
done earlier with let path = perf_script_dso_name(path)), ensuring every place
that formats "({path})" uses the normalized path returned by
perf_script_dso_name so kernel mappings like `[kernel.kallsyms]_...` are
normalized consistently.

Comment thread src/perfdata/unwind.rs
Comment on lines 410 to +419
pub fn has_unwind_info_for_ip(&self, ip: u64) -> bool {
self.reported_modules
if let Some(&cached) = self.unwind_info_memo.borrow().get(&ip) {
return cached;
}
let covered = self
.reported_modules
.iter()
.any(|module| module.unwind_ranges.iter().any(|range| range.contains(&ip)))
.any(|module| module.unwind_ranges.iter().any(|range| range.contains(&ip)));
self.unwind_info_memo.borrow_mut().insert(ip, covered);
covered

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound the exact-IP unwind memo to prevent unbounded growth.

Line 418 inserts every queried IP into unwind_info_memo, and the map is only cleared when a new module is added (Lines 383-387). In steady-state sampling with high IP cardinality, this can grow indefinitely and drive avoidable memory pressure.

💡 Suggested guard (bounded cache with simple reset)
+const UNWIND_INFO_MEMO_MAX_ENTRIES: usize = 200_000;
+
 pub fn has_unwind_info_for_ip(&self, ip: u64) -> bool {
     if let Some(&cached) = self.unwind_info_memo.borrow().get(&ip) {
         return cached;
     }
     let covered = self
         .reported_modules
         .iter()
         .any(|module| module.unwind_ranges.iter().any(|range| range.contains(&ip)));
-    self.unwind_info_memo.borrow_mut().insert(ip, covered);
+    let mut memo = self.unwind_info_memo.borrow_mut();
+    if memo.len() >= UNWIND_INFO_MEMO_MAX_ENTRIES {
+        memo.clear();
+    }
+    memo.insert(ip, covered);
     covered
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/perfdata/unwind.rs` around lines 410 - 419, The memo in
has_unwind_info_for_ip currently inserts every queried IP into unwind_info_memo
which can grow unbounded; change has_unwind_info_for_ip to bound the cache by
checking unwind_info_memo.borrow().len() (or a tracked count) against a small
MAX_UNWIND_MEMO_ENTRIES constant and, when the limit is reached, either clear
the map (or skip inserting new entries) before inserting the result; keep the
existing lookup logic and still clear the memo when reported_modules changes,
but ensure you reference has_unwind_info_for_ip, unwind_info_memo, and
reported_modules so the cache trimming/reset is applied just prior to the insert
to prevent unbounded growth.

Comment thread src/perfdata/unwind.rs
Comment on lines +485 to +508
// The seeded register file must match the active arch; a mismatch means
// the file header arch and the regs decode disagreed, which cannot
// happen because both flow from the same PerfArch.
match (self, regs) {
(Self::X86_64 { unwinder, cache }, PerfUserRegs::X86_64(regs)) => {
let mut iter = unwinder.iter_frames(ip, regs.to_framehop_regs(), cache, read_stack);
while frames.len() < max_frames {
let Ok(Some(frame)) = iter.next() else {
break;
};
push_perf_unwind_address(&mut frames, frame.address());
}
}
(Self::Aarch64 { unwinder, cache }, PerfUserRegs::Aarch64(regs)) => {
let mut iter = unwinder.iter_frames(ip, regs.to_framehop_regs(), cache, read_stack);
while frames.len() < max_frames {
let Ok(Some(frame)) = iter.next() else {
break;
};
push_perf_unwind_address(&mut frames, frame.address());
}
}
_ => {}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Make arch/register mismatches explicit instead of silently returning empty frames.

Line 507 currently swallows mismatches (_ => {}), which can look like a legitimate empty unwind and hide upstream state/contract drift. Prefer an explicit assertion/fallback signal so this failure mode is diagnosable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/perfdata/unwind.rs` around lines 485 - 508, The match currently swallows
architecture/register mismatches in the unwind loop (the `_ => {}` arm), hiding
contract drift between the PerfUnwinder variant and the PerfUserRegs variant;
replace that silent no-op with an explicit failure signal (e.g., panic!,
unreachable!, or a debug_assert!) that includes the mismatched values so it is
diagnosable — update the catch-all arm in the match over (self, regs) to call
something like unreachable!("arch/register mismatch: unwinder={:?} regs={:?}",
self, regs) or debug_assert!(false, ...) (ensure PerfUserRegs and the unwinder
type implement Debug or format a descriptive message) so mismatches in the code
paths (the branches using unwinder.iter_frames, regs.to_framehop_regs,
push_perf_unwind_address) are surfaced instead of returning empty frames.

Comment thread src/platform.rs
Comment on lines +66 to +74
let mut tids = std::fs::read_dir(&task_dir)?
.filter_map(Result::ok)
.filter_map(|entry| {
entry
.file_name()
.to_str()
.and_then(|name| name.parse().ok())
})
.collect::<Vec<u32>>();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider logging or propagating read_dir iterator errors.

Line 67's filter_map(Result::ok) silently drops any errors that occur while reading individual directory entries (e.g., permission denied on a specific task entry, I/O errors). While this pattern is common for robust directory traversal, it provides no visibility into failures.

Consider logging dropped errors or at least documenting this behavior, especially since the function later returns a NotFound error when no numeric thread IDs are found—making it unclear whether the directory was truly empty or whether all entries failed to read.

📝 Possible approaches

Option 1: Log dropped errors

     let mut tids = std::fs::read_dir(&task_dir)?
-        .filter_map(Result::ok)
+        .filter_map(|entry| {
+            entry.or_else(|err| {
+                log::debug!("Skipping task entry: {}", err);
+                Err(err)
+            }).ok()
+        })
         .filter_map(|entry| {

Option 2: Document the silent-skip behavior

 /// Reads Linux thread IDs from a procfs root, usually `/proc`.
 ///
+/// Directory entries that cannot be read are silently skipped for robustness.
+///
 /// # Errors
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut tids = std::fs::read_dir(&task_dir)?
.filter_map(Result::ok)
.filter_map(|entry| {
entry
.file_name()
.to_str()
.and_then(|name| name.parse().ok())
})
.collect::<Vec<u32>>();
let mut tids = std::fs::read_dir(&task_dir)?
.filter_map(|entry| {
entry.or_else(|err| {
log::debug!("Skipping task entry: {}", err);
Err(err)
}).ok()
})
.filter_map(|entry| {
entry
.file_name()
.to_str()
.and_then(|name| name.parse().ok())
})
.collect::<Vec<u32>>();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/platform.rs` around lines 66 - 74, The current read_dir pipeline for
building tids silently drops errors via filter_map(Result::ok); update it to
either log each dropped error or propagate the first encountered error so
callers can distinguish an empty dir from read failures: modify the iterator
that uses read_dir(&task_dir)? and filter_map(Result::ok) so errors are handled
explicitly (e.g., replace filter_map(Result::ok) with a map that matches
Result::Err(e) and either calls log::warn! with context including task_dir and e
before returning None, or returns Err(e) to propagate using a try/collect
pattern), keeping the rest of the file-name-to-u32 parsing and the tids Vec<u32>
collection unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant