DEBUG(issue831): instrumented dev_issue831 QEMU CI run (do not merge) by lacraig2 · Pull Request #836 · rehosting/penguin

lacraig2 · 2026-06-11T16:05:28Z

Temporary debug PR to characterize rehosting/qemu#9 (the host-side cpu_memory_rw_debug write that corrupts ppc64 guest execution underlying #831), in the only environment where it reproduces — CI.

What this does

QEMU_VERSION=dev_issue831 → pulls the instrumented penguin-qemu build (rehosting/qemu tag dev_issue831 → release vdev_issue831).
native_mmap forces the original PANDA host-write path (prefer_portal=False) so the ppc64 native_mmap_mtd fault reproduces under the instrumented QEMU.

What to read

The powerpc64 run_tests job log. The instrumented QEMU logs only anomalies relative to the clean local baseline (where the path was mechanically clean — BQL always held, 0 CODE-page hits across 38k writes):

[ISSUE831] *** debug-write hits CODE page ... *** — host write invalidating TBs from a live callback (the lead hypothesis), and/or
[ISSUE831] dbg_write ... bql=0 ... — callback running without the BQL held,
plus a periodic heartbeat line.

NOT for merge

Revert both changes before landing. The real fix for #831 is the portal-write workaround on config-reshape-patches (prefer_portal=True).

Use the new Jinja templating in auto-generated config patches so architecture-dependent values are resolved from core.arch at load time instead of being compiled into the generated YAML. - Expose two arch-derived template variables, {{ arch_dir }} (generic static subdir, e.g. intel64 -> x86_64) and {{ dylib_dir }} (prebuilt-dylib subdir, e.g. aarch64 -> arm64), in templating.build_context, derived from the file's core.arch (lazy/guarded imports so schema-gen contexts are unaffected). - Add arch.get_dylib_subdir() as the single source of truth for the dylib mapping; BasePatch.set_arch_info now uses it instead of an inline copy. - BasePatch.generate emits {{ dylib_dir }} / {{ arch_dir }} in the /igloo/dylibs and /igloo/utils host paths instead of baking the subdir in. core.arch and core.kernel stay concrete (they are the source values). - Tests for the derived vars, mapping parity, in-patch resolution, and that the generator emits the placeholders. Docs updated.

Capture which step faults on the ppc64 MTD read-back without aborting: - enable kernel print-fatal-signals (logs faulting comm + nip to console.log) - run the read-back as a standalone cat (capture exit code + size + hexdump) instead of inside a pipeline, before and after the write, repeated - only print the PASS markers when the read-back actually contains the new data, so working arches stay green while ppc64 records diagnostics Temporary diagnostic commit; revert once the fault is characterized.

Green run showed the standalone 'cat /dev/mtdN > file' read-back works on ppc64 -- the segfault is bound to the 'cat | strings | grep' pipeline. Add stage isolation (strings-on-file, cat|cat, cat|strings, full pipe) and gate PASS on the original pipeline so ppc64 fails and uploads artifacts with the per-step exit codes + the kernel fatal-signal line.

Warm-up reads made the fault vanish (commits 1-2 went green), so the fault needs the piped read to be the first read after the write. Restore the exact original single-read/write/single-piped-read order and only arm fault capture (print-fatal-signals, show_unhandled_signals, core_pattern -> shared, ulimit -c) so the failing ppc64 run uploads the faulting process+nip and a coredump.

The ppc64 native_mmap MTD read-back segfault is a host-side memory bug, not a test/data bug: a standalone 'cat /dev/mtdN > file' read-back returns correct data, but the original 'cat | strings | grep' pipeline deterministically segfaults a sibling process, and any test-script perturbation hides it. The MTD read callback delivered data via plugins.mem.write -> the PANDA virtual_memory_write_external fast path, which translates the guest kernel read buffer's virtual address host-side. On ppc64 that translation appears unreliable for some kbuf addresses and corrupts a co-running process. Add an opt-in write_bytes(prefer_portal=True) that skips the PANDA fast path and writes via the guest-executed portal path, and use it for the native MTD read callback. Test script restored to the exact original faulting sequence so CI verifies the fix without perturbing the heisenbug.

… tree change)

Restore the original PANDA virtual-memory write in the MTD read callback (the suspected-faulty path) and, after each write, read the same kernel VA back two ways: via PANDA (cpu_memory_rw_debug, same host translation) and via the portal (guest-executed, the guest's real mapping). Log any mismatch + the VA to results/issue831_probe.txt. - portal_rb != src (or panda_rb != portal_rb) => PANDA write hit a different physical page than the guest sees -> wrong-PA confirmed (host translation bug) - everything matches => write lands correctly; retract the corruption story (mechanism is timing/coherency, fix is perturbation) add read_bytes(prefer_portal=) to mirror write_bytes. A never-matching verifier condition forces artifact upload so the probe file is captured even if the heisenbug crash does not reproduce. Guest test script left byte-identical.

Revert the wrong-PA probe and the forced-failure verifier sentinel. The probe showed that whenever the PANDA write is observable (panda + portal read-back), panda==portal==src on every read incl. post-write read-backs at the same kernel VA class as the faulting run -> the 'wrong physical page' hypothesis is NOT supported; the per-read portal round-trip also perturbed the heisenbug away (mmap_test PASSed under the probe). Conclusion: the fault is in the host-side PANDA memory write (cpu_memory_rw_debug) used by the MTD read callback on ppc64, is timing/coherency-sensitive (any observation hides it), and writing via the guest-executed portal path deterministically avoids it. That portal-write fix is what remains here.

…tifacts

Single CI run, two contemporaneous data points (both use the PANDA write path): - native_mmap: revert fix -> PANDA write, full native-mmap machinery intact (qemu_mem aperture + dev/proc/anon mmap). BASELINE, expect red. - mtd_only (new minimal plugin+test): one OOP MTD device on the PANDA write path, NO qemu_mem aperture, NO dev/proc/anon mmap, NO micropython -- just the cat|strings|grep read-back. If native_mmap red but mtd_only green -> native-mmap/qemu_mem machinery is required to trigger the fault. If mtd_only also red -> it is the MTD callback's PANDA write itself, independent of native mmap. Temporary; restore fix after.

…olding The mtd_only isolation test was inconclusive: adding it to the suite shifted timing/layout enough to perturb the heisenbug away (baseline native_mmap with the PANDA write also passed). Differential testing via suite/script edits is confounded by the bug's fragility. Revert the experiment and keep the host-side portal-write fix; correct the code comments to the honest mechanism (timing-sensitive fault in the PANDA host write; not a confirmed wrong-PA).

…ite path Temporary CI debug branch for rehosting/qemu#9. QEMU_VERSION=dev_issue831 pulls the instrumented penguin-qemu build; native_mmap forces the original PANDA host-write path (prefer_portal=False) so the ppc64 fault reproduces in CI under the instrumented QEMU. NOT for merge: revert both before landing.

…cts) The branch is based on config-reshape-patches and conflicts with main, so the pull_request event can't build a merge ref and CI never runs. Add a push trigger for this branch. Dockerfile comment touched so the push delta counts as 'code' and run_tests fires. Remove before merge.

Luke Craig and others added 13 commits June 9, 2026 20:30

ci(issue831): re-run to confirm portal-write fix is deterministic (no…

51d1d6a

… tree change)

issue831: drop accidentally-committed CLAUDE.md and egg-info build ar…

f6544a8

…tifacts

lacraig2 closed this Jun 11, 2026

lacraig2 deleted the issue831-ci-debug branch June 11, 2026 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEBUG(issue831): instrumented dev_issue831 QEMU CI run (do not merge)#836

DEBUG(issue831): instrumented dev_issue831 QEMU CI run (do not merge)#836
lacraig2 wants to merge 13 commits into
mainfrom
issue831-ci-debug

lacraig2 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lacraig2 commented Jun 11, 2026

What this does

What to read

NOT for merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant