DEBUG(issue831): instrumented dev_issue831 QEMU CI run (do not merge)#836
Closed
lacraig2 wants to merge 13 commits into
Closed
DEBUG(issue831): instrumented dev_issue831 QEMU CI run (do not merge)#836lacraig2 wants to merge 13 commits into
lacraig2 wants to merge 13 commits into
Conversation
Use the new Jinja templating in auto-generated config patches so
architecture-dependent values are resolved from core.arch at load time instead
of being compiled into the generated YAML.
- Expose two arch-derived template variables, {{ arch_dir }} (generic static
subdir, e.g. intel64 -> x86_64) and {{ dylib_dir }} (prebuilt-dylib subdir,
e.g. aarch64 -> arm64), in templating.build_context, derived from the file's
core.arch (lazy/guarded imports so schema-gen contexts are unaffected).
- Add arch.get_dylib_subdir() as the single source of truth for the dylib
mapping; BasePatch.set_arch_info now uses it instead of an inline copy.
- BasePatch.generate emits {{ dylib_dir }} / {{ arch_dir }} in the /igloo/dylibs
and /igloo/utils host paths instead of baking the subdir in. core.arch and
core.kernel stay concrete (they are the source values).
- Tests for the derived vars, mapping parity, in-patch resolution, and that the
generator emits the placeholders. Docs updated.
Capture which step faults on the ppc64 MTD read-back without aborting: - enable kernel print-fatal-signals (logs faulting comm + nip to console.log) - run the read-back as a standalone cat (capture exit code + size + hexdump) instead of inside a pipeline, before and after the write, repeated - only print the PASS markers when the read-back actually contains the new data, so working arches stay green while ppc64 records diagnostics Temporary diagnostic commit; revert once the fault is characterized.
Green run showed the standalone 'cat /dev/mtdN > file' read-back works on ppc64 -- the segfault is bound to the 'cat | strings | grep' pipeline. Add stage isolation (strings-on-file, cat|cat, cat|strings, full pipe) and gate PASS on the original pipeline so ppc64 fails and uploads artifacts with the per-step exit codes + the kernel fatal-signal line.
Warm-up reads made the fault vanish (commits 1-2 went green), so the fault needs the piped read to be the first read after the write. Restore the exact original single-read/write/single-piped-read order and only arm fault capture (print-fatal-signals, show_unhandled_signals, core_pattern -> shared, ulimit -c) so the failing ppc64 run uploads the faulting process+nip and a coredump.
The ppc64 native_mmap MTD read-back segfault is a host-side memory bug, not a test/data bug: a standalone 'cat /dev/mtdN > file' read-back returns correct data, but the original 'cat | strings | grep' pipeline deterministically segfaults a sibling process, and any test-script perturbation hides it. The MTD read callback delivered data via plugins.mem.write -> the PANDA virtual_memory_write_external fast path, which translates the guest kernel read buffer's virtual address host-side. On ppc64 that translation appears unreliable for some kbuf addresses and corrupts a co-running process. Add an opt-in write_bytes(prefer_portal=True) that skips the PANDA fast path and writes via the guest-executed portal path, and use it for the native MTD read callback. Test script restored to the exact original faulting sequence so CI verifies the fix without perturbing the heisenbug.
Restore the original PANDA virtual-memory write in the MTD read callback (the suspected-faulty path) and, after each write, read the same kernel VA back two ways: via PANDA (cpu_memory_rw_debug, same host translation) and via the portal (guest-executed, the guest's real mapping). Log any mismatch + the VA to results/issue831_probe.txt. - portal_rb != src (or panda_rb != portal_rb) => PANDA write hit a different physical page than the guest sees -> wrong-PA confirmed (host translation bug) - everything matches => write lands correctly; retract the corruption story (mechanism is timing/coherency, fix is perturbation) add read_bytes(prefer_portal=) to mirror write_bytes. A never-matching verifier condition forces artifact upload so the probe file is captured even if the heisenbug crash does not reproduce. Guest test script left byte-identical.
Revert the wrong-PA probe and the forced-failure verifier sentinel. The probe showed that whenever the PANDA write is observable (panda + portal read-back), panda==portal==src on every read incl. post-write read-backs at the same kernel VA class as the faulting run -> the 'wrong physical page' hypothesis is NOT supported; the per-read portal round-trip also perturbed the heisenbug away (mmap_test PASSed under the probe). Conclusion: the fault is in the host-side PANDA memory write (cpu_memory_rw_debug) used by the MTD read callback on ppc64, is timing/coherency-sensitive (any observation hides it), and writing via the guest-executed portal path deterministically avoids it. That portal-write fix is what remains here.
Single CI run, two contemporaneous data points (both use the PANDA write path): - native_mmap: revert fix -> PANDA write, full native-mmap machinery intact (qemu_mem aperture + dev/proc/anon mmap). BASELINE, expect red. - mtd_only (new minimal plugin+test): one OOP MTD device on the PANDA write path, NO qemu_mem aperture, NO dev/proc/anon mmap, NO micropython -- just the cat|strings|grep read-back. If native_mmap red but mtd_only green -> native-mmap/qemu_mem machinery is required to trigger the fault. If mtd_only also red -> it is the MTD callback's PANDA write itself, independent of native mmap. Temporary; restore fix after.
…olding The mtd_only isolation test was inconclusive: adding it to the suite shifted timing/layout enough to perturb the heisenbug away (baseline native_mmap with the PANDA write also passed). Differential testing via suite/script edits is confounded by the bug's fragility. Revert the experiment and keep the host-side portal-write fix; correct the code comments to the honest mechanism (timing-sensitive fault in the PANDA host write; not a confirmed wrong-PA).
…ite path Temporary CI debug branch for rehosting/qemu#9. QEMU_VERSION=dev_issue831 pulls the instrumented penguin-qemu build; native_mmap forces the original PANDA host-write path (prefer_portal=False) so the ppc64 fault reproduces in CI under the instrumented QEMU. NOT for merge: revert both before landing.
…cts) The branch is based on config-reshape-patches and conflicts with main, so the pull_request event can't build a merge ref and CI never runs. Add a push trigger for this branch. Dockerfile comment touched so the push delta counts as 'code' and run_tests fires. Remove before merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Temporary debug PR to characterize rehosting/qemu#9 (the host-side
cpu_memory_rw_debugwrite that corrupts ppc64 guest execution underlying #831), in the only environment where it reproduces — CI.What this does
QEMU_VERSION=dev_issue831→ pulls the instrumentedpenguin-qemubuild (rehosting/qemu tagdev_issue831→ releasevdev_issue831).native_mmapforces the original PANDA host-write path (prefer_portal=False) so the ppc64native_mmap_mtdfault reproduces under the instrumented QEMU.What to read
The
powerpc64run_testsjob log. The instrumented QEMU logs only anomalies relative to the clean local baseline (where the path was mechanically clean — BQL always held, 0 CODE-page hits across 38k writes):[ISSUE831] *** debug-write hits CODE page ... ***— host write invalidating TBs from a live callback (the lead hypothesis), and/or[ISSUE831] dbg_write ... bql=0 ...— callback running without the BQL held,NOT for merge
Revert both changes before landing. The real fix for #831 is the portal-write workaround on
config-reshape-patches(prefer_portal=True).