Skip to content

DEBUG(issue831): instrumented dev_issue831 QEMU CI run (do not merge)#836

Closed
lacraig2 wants to merge 13 commits into
mainfrom
issue831-ci-debug
Closed

DEBUG(issue831): instrumented dev_issue831 QEMU CI run (do not merge)#836
lacraig2 wants to merge 13 commits into
mainfrom
issue831-ci-debug

Conversation

@lacraig2

Copy link
Copy Markdown
Collaborator

Temporary debug PR to characterize rehosting/qemu#9 (the host-side cpu_memory_rw_debug write that corrupts ppc64 guest execution underlying #831), in the only environment where it reproduces — CI.

What this does

  • QEMU_VERSION=dev_issue831 → pulls the instrumented penguin-qemu build (rehosting/qemu tag dev_issue831 → release vdev_issue831).
  • native_mmap forces the original PANDA host-write path (prefer_portal=False) so the ppc64 native_mmap_mtd fault reproduces under the instrumented QEMU.

What to read

The powerpc64 run_tests job log. The instrumented QEMU logs only anomalies relative to the clean local baseline (where the path was mechanically clean — BQL always held, 0 CODE-page hits across 38k writes):

  • [ISSUE831] *** debug-write hits CODE page ... *** — host write invalidating TBs from a live callback (the lead hypothesis), and/or
  • [ISSUE831] dbg_write ... bql=0 ... — callback running without the BQL held,
  • plus a periodic heartbeat line.

NOT for merge

Revert both changes before landing. The real fix for #831 is the portal-write workaround on config-reshape-patches (prefer_portal=True).

Luke Craig and others added 13 commits June 9, 2026 20:30
Use the new Jinja templating in auto-generated config patches so
architecture-dependent values are resolved from core.arch at load time instead
of being compiled into the generated YAML.

- Expose two arch-derived template variables, {{ arch_dir }} (generic static
  subdir, e.g. intel64 -> x86_64) and {{ dylib_dir }} (prebuilt-dylib subdir,
  e.g. aarch64 -> arm64), in templating.build_context, derived from the file's
  core.arch (lazy/guarded imports so schema-gen contexts are unaffected).
- Add arch.get_dylib_subdir() as the single source of truth for the dylib
  mapping; BasePatch.set_arch_info now uses it instead of an inline copy.
- BasePatch.generate emits {{ dylib_dir }} / {{ arch_dir }} in the /igloo/dylibs
  and /igloo/utils host paths instead of baking the subdir in. core.arch and
  core.kernel stay concrete (they are the source values).
- Tests for the derived vars, mapping parity, in-patch resolution, and that the
  generator emits the placeholders. Docs updated.
Capture which step faults on the ppc64 MTD read-back without aborting:
- enable kernel print-fatal-signals (logs faulting comm + nip to console.log)
- run the read-back as a standalone cat (capture exit code + size + hexdump)
  instead of inside a pipeline, before and after the write, repeated
- only print the PASS markers when the read-back actually contains the new
  data, so working arches stay green while ppc64 records diagnostics

Temporary diagnostic commit; revert once the fault is characterized.
Green run showed the standalone 'cat /dev/mtdN > file' read-back works on
ppc64 -- the segfault is bound to the 'cat | strings | grep' pipeline.
Add stage isolation (strings-on-file, cat|cat, cat|strings, full pipe) and
gate PASS on the original pipeline so ppc64 fails and uploads artifacts with
the per-step exit codes + the kernel fatal-signal line.
Warm-up reads made the fault vanish (commits 1-2 went green), so the fault
needs the piped read to be the first read after the write. Restore the exact
original single-read/write/single-piped-read order and only arm fault capture
(print-fatal-signals, show_unhandled_signals, core_pattern -> shared, ulimit -c)
so the failing ppc64 run uploads the faulting process+nip and a coredump.
The ppc64 native_mmap MTD read-back segfault is a host-side memory bug, not a
test/data bug: a standalone 'cat /dev/mtdN > file' read-back returns correct
data, but the original 'cat | strings | grep' pipeline deterministically
segfaults a sibling process, and any test-script perturbation hides it.

The MTD read callback delivered data via plugins.mem.write -> the PANDA
virtual_memory_write_external fast path, which translates the guest kernel
read buffer's virtual address host-side. On ppc64 that translation appears
unreliable for some kbuf addresses and corrupts a co-running process.

Add an opt-in write_bytes(prefer_portal=True) that skips the PANDA fast path
and writes via the guest-executed portal path, and use it for the native MTD
read callback. Test script restored to the exact original faulting sequence so
CI verifies the fix without perturbing the heisenbug.
Restore the original PANDA virtual-memory write in the MTD read callback (the
suspected-faulty path) and, after each write, read the same kernel VA back two
ways: via PANDA (cpu_memory_rw_debug, same host translation) and via the portal
(guest-executed, the guest's real mapping). Log any mismatch + the VA to
results/issue831_probe.txt.

- portal_rb != src (or panda_rb != portal_rb)  => PANDA write hit a different
  physical page than the guest sees -> wrong-PA confirmed (host translation bug)
- everything matches                            => write lands correctly; retract
  the corruption story (mechanism is timing/coherency, fix is perturbation)

add read_bytes(prefer_portal=) to mirror write_bytes. A never-matching verifier
condition forces artifact upload so the probe file is captured even if the
heisenbug crash does not reproduce. Guest test script left byte-identical.
Revert the wrong-PA probe and the forced-failure verifier sentinel. The probe
showed that whenever the PANDA write is observable (panda + portal read-back),
panda==portal==src on every read incl. post-write read-backs at the same kernel
VA class as the faulting run -> the 'wrong physical page' hypothesis is NOT
supported; the per-read portal round-trip also perturbed the heisenbug away
(mmap_test PASSed under the probe).

Conclusion: the fault is in the host-side PANDA memory write (cpu_memory_rw_debug)
used by the MTD read callback on ppc64, is timing/coherency-sensitive (any
observation hides it), and writing via the guest-executed portal path
deterministically avoids it. That portal-write fix is what remains here.
Single CI run, two contemporaneous data points (both use the PANDA write path):
- native_mmap: revert fix -> PANDA write, full native-mmap machinery intact
  (qemu_mem aperture + dev/proc/anon mmap). BASELINE, expect red.
- mtd_only (new minimal plugin+test): one OOP MTD device on the PANDA write
  path, NO qemu_mem aperture, NO dev/proc/anon mmap, NO micropython -- just the
  cat|strings|grep read-back.

If native_mmap red but mtd_only green -> native-mmap/qemu_mem machinery is
required to trigger the fault. If mtd_only also red -> it is the MTD callback's
PANDA write itself, independent of native mmap. Temporary; restore fix after.
…olding

The mtd_only isolation test was inconclusive: adding it to the suite shifted
timing/layout enough to perturb the heisenbug away (baseline native_mmap with
the PANDA write also passed). Differential testing via suite/script edits is
confounded by the bug's fragility. Revert the experiment and keep the
host-side portal-write fix; correct the code comments to the honest mechanism
(timing-sensitive fault in the PANDA host write; not a confirmed wrong-PA).
…ite path

Temporary CI debug branch for rehosting/qemu#9. QEMU_VERSION=dev_issue831
pulls the instrumented penguin-qemu build; native_mmap forces the original
PANDA host-write path (prefer_portal=False) so the ppc64 fault reproduces in
CI under the instrumented QEMU. NOT for merge: revert both before landing.
…cts)

The branch is based on config-reshape-patches and conflicts with main, so the
pull_request event can't build a merge ref and CI never runs. Add a push
trigger for this branch. Dockerfile comment touched so the push delta counts
as 'code' and run_tests fires. Remove before merge.
@lacraig2 lacraig2 closed this Jun 11, 2026
@lacraig2 lacraig2 deleted the issue831-ci-debug branch June 11, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant