Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Jan 5, 2026

Depends on #2313, #2350, #2358
Fixes #2309

It's currently somewhat difficult to become aware of Hubris task panics and other task faults in a production environment. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. Therefore, we should have a proactive notification from the SP indicating that task faults have occurred.

This commit adds code to packrat for producing an ereport when a task has faulted. This could eventually be used by the control plane to trigger dump collection and produce a service bundle. In addition, it will provide a more permanent record that a task faulted at a particular time, even if the SP that contains the faulted task is later reset or replaced with an entirely different SP. This works using an approach similar to the one described by @cbiffle in this comment. There's a detailed description of how this works in the module-level RustDoc for ereport.rs in Packrat.

The ereports that come out of this thing look like this:

eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 15 10:27:41.370 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 15 10:27:41.372 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: 4e54b7f1-e13a-d9bb-709a-c7e863d64a64
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 4

ereports:
0x1: {
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("ereportulator"),
    "hubris_uptime_ms": Number(378010),
    "k": String("hubris.fault.panic"),
    "msg": String("panicked at task/ereportulator/src/main.rs:158:9:\nim dead lol"),
    "v": Number(0),
}

0x3: {
    "by": Object {
        "gen": Number(0),
        "task": String("jefe"),
    },
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("user_leds"),
    "hubris_uptime_ms": Number(382914),
    "k": String("hubris.fault.injected"),
    "v": Number(0),
}

0x4: {
    "by": Object {
        "gen": Number(0),
        "task": String("jefe"),
    },
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(1),
    "hubris_task_name": String("ereportulator"),
    "hubris_uptime_ms": Number(388215),
    "k": String("hubris.fault.injected"),
    "v": Number(0),
}

Base automatically changed from eliza/read-panic-message to master January 5, 2026 23:36
@hawkw hawkw force-pushed the eliza/fault-ereport branch from 0adecf4 to bc6268d Compare January 5, 2026 23:59
@hawkw hawkw added service processor Related to the service processor. psc Related to the power shelf controller gimlet cosmo SP5 Board fault-management Everything related to the Oxide's Fault Management architecture implementation ⚠️ ereport if you see something, say something! labels Jan 6, 2026
@hawkw
Copy link
Member Author

hawkw commented Jan 6, 2026

Thinking about things a bit more, there's some more changes I think I want to make here before it's really ready to land. In particular:

  • Currently, we've changed the fault cooldown behavior in Jefe to just always wait 50 ms between restarts, rather than only doing so when a task has not been running for at least that long between two subsequent faults (see https://github.com/oxidecomputer/hubris/blob/b31df4359ffc4f06136474fee952e32f9466b34e/task/jefe/src/main.rs). This means that there's now always 50 ms of latency for all task restarts. This is to give Packrat time to generate an ereport, but it feels a bit not great.

    I think we should be doing a somewhat more complex thing here. We should probably implement the approach that @cbiffle described in ereport: hubris task panicked/faulted #2309 (comment), and add a way for Packrat to let Jefe know it has finished generating an ereport for a fault. That way, we can possibly reduce the latency for restarts a bit by saying "we will always give Packrat up to 50ms to produce a fault report, but if it finishes before then and the task has already been running for a while, we will restart it sooner".

  • Currently, if all faulted tasks have already been restarted by the time Packrat actually processes the "task faulted" notification, Packrat just does nothing.1 We should maybe fix this by having packrat do some kind of "some task probably faulted but I couldn't figure out which one" ereport, so that it's not totally lost.

Personally, I think we should definitely do the second point here (some kind of "task faults may have occurred" ereport) before merging this PR. I'm on the fence about whether the first point (reducing restart latency) is worth doing now or not. It's a bit more complexity in Jefe...

@cbiffle, any thoughts?

Footnotes

  1. Well, it ringbufs about it, but in production, that's equivalent to "doing nothing".

hawkw added a commit that referenced this pull request Jan 6, 2026
Comment on lines 449 to 462
if faulted_tasks == 0 {
// Okay, this is a bit weird. We got a notification saying tasks
// have faulted, but by the time we scanned for faulted tasks, we
// couldn't find any. This means one of two things:
//
// 1. The fault notification was spurious (in practice, this
// probably means someone is dorking around with Hiffy and sent
// a fake notification just to mess with us...)
// 2. Tasks faulted, but we were not scheduled for at least 50ms
// after the faults occurred, and Jefe has already restarted
// them by the time we were permtited to run.
//
// We should probably record some kind of ereport about this.
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still wanna figure out what i need to put in the ereport in this case --- what class should it be, etc. hubris.fault.maybe_faults or something weird like that.

It's also a bit strange because the function for recording an ereport in the ereport ringbuffer requires a task ID as part of the insert function. For all the other ereports, I've used the ID of the task that faulted for that field, rather than the ID of Packrat (who is actually generating the ereport) or Jefe (who is spiritually sort of responsible for reporting it in some vibes-based way); this felt like the right thing in general. However, when the ereport just says "some task may have faulted", I'm not totally sure what ID I want to put in here, since I don't want to incorrectly suggest that Jefe or Packrat has faulted...hmm...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think taskID of Packrat and a distinguishing class would be fine.

@cbiffle
Copy link
Collaborator

cbiffle commented Jan 6, 2026

So I think the "max of 50ms" interplay between Jefe and Packrat sounds promising, and doesn't seem like too much additional supervisor complexity -- particularly while crashdumps are still in Jefe. If we want to reduce complexity, I'd start there.

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

That said, packrat is by nature typically just one priority level under Jefe, so it should be able to respond in a timely fashion in most cases. The thing most likely to starve it is ... crash dumps.

@hawkw
Copy link
Member Author

hawkw commented Jan 6, 2026

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of hubris_num_tasks::NUM_TASKS counters or some such.

@hawkw
Copy link
Member Author

hawkw commented Jan 6, 2026

I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record.

Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of hubris_num_tasks::NUM_TASKS counters or some such.

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

Here's my attempt at doing that, which is both conceptually quite elegant and implementationally somewhat disgusting: eliza/fault-ereport...eliza/fault-counts#diff-48cf874f5ac8432941e2ba390792b33a94f9aea18dd933bbdb105cd23b93c9ee

@cbiffle
Copy link
Collaborator

cbiffle commented Jan 6, 2026

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent.

But for now it's basically equivalent I think?

@hawkw
Copy link
Member Author

hawkw commented Jan 7, 2026

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent.

But for now it's basically equivalent I think?

After a bit more thinking, I'm thinking about going back to an approach where we ask Jefe to send us a list of fault counters explicitly, rather than looking at generations. This is mostly for the reason @cbiffle points out: a task can also explicitly ask to be restarted without faulting (though I'm not sure if anything in our production images actually uses this capability). It has a couple other advantages, though: it's a bit quicker for Packrat to do (one IPC to Jefe rather than NUM_TASKS syscalls), and it lets us use a bigger counter than the u8 generation number, which reduces the likelihood that the counter will wrap around and end up at the same value it was last time Packrat checked, missing the fault.

On the other hand, this would mean that we can no longer uphold the property that "Packrat never makes IPC requests to other tasks", which is documented in a few places. I think an infallible IPC to the supervisor is probably safe, but I'm not sure if we're comfortable violating that property for any reason...

@hawkw
Copy link
Member Author

hawkw commented Jan 12, 2026

Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted.

I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent.
But for now it's basically equivalent I think?

After a bit more thinking, I'm thinking about going back to an approach where we ask Jefe to send us a list of fault counters explicitly, rather than looking at generations. This is mostly for the reason @cbiffle points out: a task can also explicitly ask to be restarted without faulting (though I'm not sure if anything in our production images actually uses this capability). It has a couple other advantages, though: it's a bit quicker for Packrat to do (one IPC to Jefe rather than NUM_TASKS syscalls), and it lets us use a bigger counter than the u8 generation number, which reduces the likelihood that the counter will wrap around and end up at the same value it was last time Packrat checked, missing the fault.

On the other hand, this would mean that we can no longer uphold the property that "Packrat never makes IPC requests to other tasks", which is documented in a few places. I think an infallible IPC to the supervisor is probably safe, but I'm not sure if we're comfortable violating that property for any reason...

OKAY NEVERMIND IT TURNS OUT I WAS ACTUALLY SUPER WRONG ABOUT THIS AND WE ACTUALLY DO NEED A DEDICATED NOTION OF A "FAULT COUNTER". Using a "restart count" (whether generations from refresh_task_id or the full-size 32-bit generation from the kernel) won't work, because the restart count/generation is incremented when the task is restarted. If Packrat responds to a Jefe notification in a timely manner and tries to handle the fault before the task has been fully restarted, it actually will appear to have not faulted, because the generation count hasn't been incremented yet...and that's the entire time window for which we can read an entire panic message and fault details. So that wrecks it. Never mind. It also doesn't work in situations such as one where the task is held by Hiffy and doesn't get restarted.

I'm going to go back to the Jefe fault counter IPC, since that would actually do the right thing, and I don't feel like making Packrat a client of the supervisor is the end of the world.

hawkw added a commit that referenced this pull request Jan 12, 2026
This commit adds a new fault-counting capability to Jefe to track the
total number of times a given task has faulted. This is requisite for
#2341 for reasons I discussed in detail in [this comment][1] and
friends. To wit, we cannot easily use the task's generation from its
task ID (or the corresponding 32-bit generation count from the kernel)
for detecting faults, as those count the number of times the task has
*restarted*, and we hope that in the common case, Packrat will generate
ereports for faults *before* Jefe has actually restarted the faulted
task, so that the panic message can be read out of the dead task's
corpse before it's clobbered and so forth. Also, a task may fault and
*not* be restarted, such as if someone is "influencing Jefe externally"
via the use of `humility jefe -H`. And, finally, tasks may explicitly
ask to be restarted without faulting. Thus, we must ask Jefe for an
actual fault counter, rather than attempting to use the generation
number as an imitation fault counter.

The new IPC returns an array of `u32` counters that's
`hubris_num_tasks::NUM_TASKS` long, since the intended use case is to
read _all_ fault counts in Packrat and perform a scan for tasks whose
counts have changed. This felt better than doing a separate IPC for each
task. I've feature flagged this thing so that we can save several bytes
of Jefe on the little boards.

[1]:
#2341 (comment)
hawkw added a commit that referenced this pull request Jan 13, 2026
@hawkw hawkw force-pushed the eliza/fault-ereport branch from 75538b7 to e23d05c Compare January 13, 2026 19:37
hawkw added a commit that referenced this pull request Jan 14, 2026
@hawkw hawkw force-pushed the eliza/fault-ereport branch from e23d05c to bc192b9 Compare January 14, 2026 00:03
@hawkw
Copy link
Member Author

hawkw commented Jan 14, 2026

current thing works:

eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 14 10:00:29.534 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 14 10:00:29.535 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: 83369fba-36bc-4759-7fb9-0d58d92d014a
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 1

ereports:
0x1: {
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}


eliza@hekate ~/Code/oxide/hubris $ HUMILITY_TARGET=gimletlet cargo xtask humility app/gimletlet/app-ereportlet.toml -- hiffy -c Ereportulator.panicme
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.11s
     Running `target/debug/xtask humility app/gimletlet/app-ereportlet.toml -- hiffy -c Ereportulator.panicme`
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:3754:000B00154D46501520383832 via ST-Link V3
Ereportulator.panicme() => Err(<server died; its new ID is 1>)

eliza@hekate ~/Code/oxide/hubris $ HUMILITY_TARGET=gimletlet cargo xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f user_leds
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.11s
     Running `target/debug/xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f user_leds`
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:3754:000B00154D46501520383832 via ST-Link V3
humility: successfully changed disposition for user_leds

eliza@hekate ~/Code/oxide/hubris $ HUMILITY_TARGET=gimletlet cargo xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f ereportulator
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.12s
     Running `target/debug/xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f ereportulator`
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:3754:000B00154D46501520383832 via ST-Link V3
humility: successfully changed disposition for ereportulator

eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 14 10:00:55.075 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 14 10:00:55.076 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: 83369fba-36bc-4759-7fb9-0d58d92d014a
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 4

ereports:
0x1: {
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("ereportulator"),
    "hubris_uptime_ms": Number(46439),
    "k": String("hubris.fault.panic"),
    "msg": String("panicked at task/ereportulator/src/main.rs:158:9:\nim dead lol"),
    "v": Number(0),
}

0x3: {
    "by": Object {
        "gen": Number(0),
        "task": String("jefe"),
    },
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("user_leds"),
    "hubris_uptime_ms": Number(52888),
    "k": String("hubris.fault.injected"),
    "v": Number(0),
}

0x4: {
    "by": Object {
        "gen": Number(0),
        "task": String("jefe"),
    },
    "ereport_message_version": Number(0),
    "hubris_task_gen": Number(1),
    "hubris_task_name": String("ereportulator"),
    "hubris_uptime_ms": Number(62089),
    "k": String("hubris.fault.injected"),
    "v": Number(0),
}

hawkw added a commit that referenced this pull request Jan 14, 2026
This commit adds a new fault-counting capability to Jefe to track the
total number of times a given task has faulted. This is requisite for
#2341 for reasons I discussed in detail in [this comment][1] and
friends. To wit, we cannot easily use the task's generation from its
task ID (or the corresponding 32-bit generation count from the kernel)
for detecting faults, as those count the number of times the task has
*restarted*, and we hope that in the common case, Packrat will generate
ereports for faults *before* Jefe has actually restarted the faulted
task, so that the panic message can be read out of the dead task's
corpse before it's clobbered and so forth. Also, a task may fault and
*not* be restarted, such as if someone is "influencing Jefe externally"
via the use of `humility jefe -H`. And, finally, tasks may explicitly
ask to be restarted without faulting. Thus, we must ask Jefe for an
actual fault counter, rather than attempting to use the generation
number as an imitation fault counter.

The new IPC takes a leased array of `u32` counters that's
`hubris_num_tasks::NUM_TASKS` long, since the intended use case is to
read _all_ fault counts in Packrat and perform a scan for tasks whose
counts have changed. This felt better than doing a separate IPC for each
task. To use a leased fixed-size array effectively here required an
`idol-runtime` change (oxidecomputer/idolatry#71) in order to allow
`jefe` to write to the array piece-by-piece, without having to construct
an array on its stack and write the whole thing into the lease. This
branch also updates the `idol-runtime` dependency to include this
change.

I've feature flagged this thing so that we can save several bytes of
Jefe on the little boards.

[1]:
    #2341 (comment)
hawkw added a commit that referenced this pull request Jan 14, 2026
@hawkw hawkw force-pushed the eliza/fault-ereport branch from 11fe610 to bcd7a04 Compare January 14, 2026 22:30
Copy link
Collaborator

@cbiffle cbiffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna need to spend more time reading this, but, two notes from the initial pass...

@hawkw hawkw force-pushed the eliza/fault-ereport branch from 9b4d1f0 to 682de7b Compare January 15, 2026 18:20
Comment on lines +803 to +807
// If the task has faulted multiple times since the last ereport we
// generated for it, record the count.
if nfaults > 1 {
encoder.str("nfaults")?.u32(nfaults as u32)?;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we really should produce a separate ereport for held faults and the most recent fault, so that we're not giving the implication that they're multiple instances of the same fault kind when we are able to read the cause of the most recent one...

hawkw and others added 2 commits January 15, 2026 13:49
@hawkw hawkw enabled auto-merge (squash) January 15, 2026 22:00
@hawkw hawkw merged commit d37a381 into master Jan 15, 2026
174 checks passed
@hawkw hawkw deleted the eliza/fault-ereport branch January 15, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cosmo SP5 Board ⚠️ ereport if you see something, say something! fault-management Everything related to the Oxide's Fault Management architecture implementation gimlet psc Related to the power shelf controller service processor Related to the service processor.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ereport: hubris task panicked/faulted

3 participants