-
Notifications
You must be signed in to change notification settings - Fork 218
ereport: Task faulted/panicked #2341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0adecf4 to
bc6268d
Compare
|
Thinking about things a bit more, there's some more changes I think I want to make here before it's really ready to land. In particular:
Personally, I think we should definitely do the second point here (some kind of "task faults may have occurred" ereport) before merging this PR. I'm on the fence about whether the first point (reducing restart latency) is worth doing now or not. It's a bit more complexity in Jefe... @cbiffle, any thoughts? Footnotes
|
task/packrat/src/ereport.rs
Outdated
| if faulted_tasks == 0 { | ||
| // Okay, this is a bit weird. We got a notification saying tasks | ||
| // have faulted, but by the time we scanned for faulted tasks, we | ||
| // couldn't find any. This means one of two things: | ||
| // | ||
| // 1. The fault notification was spurious (in practice, this | ||
| // probably means someone is dorking around with Hiffy and sent | ||
| // a fake notification just to mess with us...) | ||
| // 2. Tasks faulted, but we were not scheduled for at least 50ms | ||
| // after the faults occurred, and Jefe has already restarted | ||
| // them by the time we were permtited to run. | ||
| // | ||
| // We should probably record some kind of ereport about this. | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still wanna figure out what i need to put in the ereport in this case --- what class should it be, etc. hubris.fault.maybe_faults or something weird like that.
It's also a bit strange because the function for recording an ereport in the ereport ringbuffer requires a task ID as part of the insert function. For all the other ereports, I've used the ID of the task that faulted for that field, rather than the ID of Packrat (who is actually generating the ereport) or Jefe (who is spiritually sort of responsible for reporting it in some vibes-based way); this felt like the right thing in general. However, when the ereport just says "some task may have faulted", I'm not totally sure what ID I want to put in here, since I don't want to incorrectly suggest that Jefe or Packrat has faulted...hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think taskID of Packrat and a distinguishing class would be fine.
|
So I think the "max of 50ms" interplay between Jefe and Packrat sounds promising, and doesn't seem like too much additional supervisor complexity -- particularly while crashdumps are still in Jefe. If we want to reduce complexity, I'd start there. I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record. That said, packrat is by nature typically just one priority level under Jefe, so it should be able to respond in a timely fashion in most cases. The thing most likely to starve it is ... crash dumps. |
Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of |
Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted. Here's my attempt at doing that, which is both conceptually quite elegant and implementationally somewhat disgusting: eliza/fault-ereport...eliza/fault-counts#diff-48cf874f5ac8432941e2ba390792b33a94f9aea18dd933bbdb105cd23b93c9ee |
I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent. But for now it's basically equivalent I think? |
After a bit more thinking, I'm thinking about going back to an approach where we ask Jefe to send us a list of fault counters explicitly, rather than looking at generations. This is mostly for the reason @cbiffle points out: a task can also explicitly ask to be restarted without faulting (though I'm not sure if anything in our production images actually uses this capability). It has a couple other advantages, though: it's a bit quicker for Packrat to do (one IPC to Jefe rather than On the other hand, this would mean that we can no longer uphold the property that "Packrat never makes IPC requests to other tasks", which is documented in a few places. I think an infallible IPC to the supervisor is probably safe, but I'm not sure if we're comfortable violating that property for any reason... |
OKAY NEVERMIND IT TURNS OUT I WAS ACTUALLY SUPER WRONG ABOUT THIS AND WE ACTUALLY DO NEED A DEDICATED NOTION OF A "FAULT COUNTER". Using a "restart count" (whether generations from refresh_task_id or the full-size 32-bit generation from the kernel) won't work, because the restart count/generation is incremented when the task is restarted. If Packrat responds to a Jefe notification in a timely manner and tries to handle the fault before the task has been fully restarted, it actually will appear to have not faulted, because the generation count hasn't been incremented yet...and that's the entire time window for which we can read an entire panic message and fault details. So that wrecks it. Never mind. It also doesn't work in situations such as one where the task is held by Hiffy and doesn't get restarted. I'm going to go back to the Jefe fault counter IPC, since that would actually do the right thing, and I don't feel like making Packrat a client of the supervisor is the end of the world. |
This commit adds a new fault-counting capability to Jefe to track the total number of times a given task has faulted. This is requisite for #2341 for reasons I discussed in detail in [this comment][1] and friends. To wit, we cannot easily use the task's generation from its task ID (or the corresponding 32-bit generation count from the kernel) for detecting faults, as those count the number of times the task has *restarted*, and we hope that in the common case, Packrat will generate ereports for faults *before* Jefe has actually restarted the faulted task, so that the panic message can be read out of the dead task's corpse before it's clobbered and so forth. Also, a task may fault and *not* be restarted, such as if someone is "influencing Jefe externally" via the use of `humility jefe -H`. And, finally, tasks may explicitly ask to be restarted without faulting. Thus, we must ask Jefe for an actual fault counter, rather than attempting to use the generation number as an imitation fault counter. The new IPC returns an array of `u32` counters that's `hubris_num_tasks::NUM_TASKS` long, since the intended use case is to read _all_ fault counts in Packrat and perform a scan for tasks whose counts have changed. This felt better than doing a separate IPC for each task. I've feature flagged this thing so that we can save several bytes of Jefe on the little boards. [1]: #2341 (comment)
75538b7 to
e23d05c
Compare
e23d05c to
bc192b9
Compare
|
current thing works: eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 14 10:00:29.534 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 14 10:00:29.535 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: 83369fba-36bc-4759-7fb9-0d58d92d014a
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 1
ereports:
0x1: {
"ereport_message_version": Number(0),
"hubris_task_gen": Number(0),
"hubris_task_name": String("packrat"),
"hubris_uptime_ms": Number(0),
"lost": Null,
}
eliza@hekate ~/Code/oxide/hubris $ HUMILITY_TARGET=gimletlet cargo xtask humility app/gimletlet/app-ereportlet.toml -- hiffy -c Ereportulator.panicme
Finished `dev` profile [optimized + debuginfo] target(s) in 0.11s
Running `target/debug/xtask humility app/gimletlet/app-ereportlet.toml -- hiffy -c Ereportulator.panicme`
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:3754:000B00154D46501520383832 via ST-Link V3
Ereportulator.panicme() => Err(<server died; its new ID is 1>)
eliza@hekate ~/Code/oxide/hubris $ HUMILITY_TARGET=gimletlet cargo xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f user_leds
Finished `dev` profile [optimized + debuginfo] target(s) in 0.11s
Running `target/debug/xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f user_leds`
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:3754:000B00154D46501520383832 via ST-Link V3
humility: successfully changed disposition for user_leds
eliza@hekate ~/Code/oxide/hubris $ HUMILITY_TARGET=gimletlet cargo xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f ereportulator
Finished `dev` profile [optimized + debuginfo] target(s) in 0.12s
Running `target/debug/xtask humility app/gimletlet/app-ereportlet.toml -- jefe -f ereportulator`
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:3754:000B00154D46501520383832 via ST-Link V3
humility: successfully changed disposition for ereportulator
eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports
Jan 14 10:00:55.075 INFO creating SP handle on interface eno1np0, component: faux-mgs
Jan 14 10:00:55.076 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs
restart ID: 83369fba-36bc-4759-7fb9-0d58d92d014a
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 4
ereports:
0x1: {
"ereport_message_version": Number(0),
"hubris_task_gen": Number(0),
"hubris_task_name": String("packrat"),
"hubris_uptime_ms": Number(0),
"lost": Null,
}
0x2: {
"ereport_message_version": Number(0),
"hubris_task_gen": Number(0),
"hubris_task_name": String("ereportulator"),
"hubris_uptime_ms": Number(46439),
"k": String("hubris.fault.panic"),
"msg": String("panicked at task/ereportulator/src/main.rs:158:9:\nim dead lol"),
"v": Number(0),
}
0x3: {
"by": Object {
"gen": Number(0),
"task": String("jefe"),
},
"ereport_message_version": Number(0),
"hubris_task_gen": Number(0),
"hubris_task_name": String("user_leds"),
"hubris_uptime_ms": Number(52888),
"k": String("hubris.fault.injected"),
"v": Number(0),
}
0x4: {
"by": Object {
"gen": Number(0),
"task": String("jefe"),
},
"ereport_message_version": Number(0),
"hubris_task_gen": Number(1),
"hubris_task_name": String("ereportulator"),
"hubris_uptime_ms": Number(62089),
"k": String("hubris.fault.injected"),
"v": Number(0),
} |
This commit adds a new fault-counting capability to Jefe to track the total number of times a given task has faulted. This is requisite for #2341 for reasons I discussed in detail in [this comment][1] and friends. To wit, we cannot easily use the task's generation from its task ID (or the corresponding 32-bit generation count from the kernel) for detecting faults, as those count the number of times the task has *restarted*, and we hope that in the common case, Packrat will generate ereports for faults *before* Jefe has actually restarted the faulted task, so that the panic message can be read out of the dead task's corpse before it's clobbered and so forth. Also, a task may fault and *not* be restarted, such as if someone is "influencing Jefe externally" via the use of `humility jefe -H`. And, finally, tasks may explicitly ask to be restarted without faulting. Thus, we must ask Jefe for an actual fault counter, rather than attempting to use the generation number as an imitation fault counter. The new IPC takes a leased array of `u32` counters that's `hubris_num_tasks::NUM_TASKS` long, since the intended use case is to read _all_ fault counts in Packrat and perform a scan for tasks whose counts have changed. This felt better than doing a separate IPC for each task. To use a leased fixed-size array effectively here required an `idol-runtime` change (oxidecomputer/idolatry#71) in order to allow `jefe` to write to the array piece-by-piece, without having to construct an array on its stack and write the whole thing into the lease. This branch also updates the `idol-runtime` dependency to include this change. I've feature flagged this thing so that we can save several bytes of Jefe on the little boards. [1]: #2341 (comment)
11fe610 to
bcd7a04
Compare
cbiffle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna need to spend more time reading this, but, two notes from the initial pass...
9b4d1f0 to
682de7b
Compare
| // If the task has faulted multiple times since the last ereport we | ||
| // generated for it, record the count. | ||
| if nfaults > 1 { | ||
| encoder.str("nfaults")?.u32(nfaults as u32)?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we really should produce a separate ereport for held faults and the most recent fault, so that we're not giving the implication that they're multiple instances of the same fault kind when we are able to read the cause of the most recent one...
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Depends on #2313, #2350, #2358
Fixes #2309
It's currently somewhat difficult to become aware of Hubris task panics and other task faults in a production environment. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. Therefore, we should have a proactive notification from the SP indicating that task faults have occurred.
This commit adds code to
packratfor producing an ereport when a task has faulted. This could eventually be used by the control plane to trigger dump collection and produce a service bundle. In addition, it will provide a more permanent record that a task faulted at a particular time, even if the SP that contains the faulted task is later reset or replaced with an entirely different SP. This works using an approach similar to the one described by @cbiffle in this comment. There's a detailed description of how this works in the module-level RustDoc forereport.rsin Packrat.The ereports that come out of this thing look like this: