sled-agent should handle Propolis zones being deleted before the service has come up

Thanks to @jmpesp chaos-monkeying the control plane in the Canada region, we have discovered that sled-agent does not seem to properly handle the abrupt disappearance of a Propolis zone (e.g. due to a `zoneadm halt` and `zonecfg delete -F`) _before_ the Propolis service in that zone has actually started.

Previously, we merged #7794 in an attempt to fix sled-agent's confusion in the event of forceful deletion of Propolis zones (see #7563). However, this change only worked by making the `InstanceStateMonitor` task check whether the zone still exists when it sees an HTTP communication error while trying to check on the instance's state. This is what would occur if a Propolis zone had started successfully, `sled-agent` had observed it starting and began monitoring the instance, and then the zone was deleted. However, in the case where the zone is deleted _before_ the propolis service has started, we don't see that happen.

For instance, James posts the following sled-agent log fragment:
```
19:22:18.438Z WARN SledAgent (InstanceManager): wait for service svc:/milestone/single-user:default in zone Some("oxz_propolis-server_17398ae4-34dc-4e5b-89e7-18bf70d4526d") failed: Property not found. retry in 745.248441ms
    file = illumos-utils/src/svc.rs:36
    instance_id = f5a2b1fe-c737-4334-be88-0d7eb8d3451c
    propolis_id = 17398ae4-34dc-4e5b-89e7-18bf70d4526d
    zone = oxz_propolis-server_17398ae4-34dc-4e5b-89e7-18bf70d4526d
19:22:18.602Z WARN SledAgent (InstanceManager): wait for service svc:/milestone/single-user:default in zone Some("oxz_propolis-server_fce209bb-e6b7-4935-85f8-b5c2c041f44c") failed: Property not found. retry in 689.749231ms
    file = illumos-utils/src/svc.rs:36
    instance_id = b8519513-c81d-4e01-ab44-f0da6ab17876
    propolis_id = fce209bb-e6b7-4935-85f8-b5c2c041f44c
    zone = oxz_propolis-server_fce209bb-e6b7-4935-85f8-b5c2c041f44c

```

This is logged by the [`wait_for_service` function](https://github.com/oxidecomputer/omicron/blob/26b33a11962d82b29fde9a2e3233d038d5495c44/illumos-utils/src/svc.rs#L28-L89) in `illumos-utils`.

I'm pretty sure that means we were in `RunningZone::boot`, which calls `wait_for_service` with the FMRI `svc:/milestone/single-user:default`:

https://github.com/oxidecomputer/omicron/blob/26b33a11962d82b29fde9a2e3233d038d5495c44/illumos-utils/src/running_zone.rs#L299-L306

That, in turn, was called by `InstanceManager`  here: https://github.com/oxidecomputer/omicron/blob/26b33a11962d82b29fde9a2e3233d038d5495c44/sled-agent/src/instance.rs#L2250-L2251

I think we can fix this by making `wait_for_service` also check whether the zone exists at all when we see a "property not found" error, and return a non-retryable error in that case. However, we should be careful to make sure there isn't a possible race there --- can we also see that error in a case where a zone hasn't yet been created? I'm not sure.

As an aside, I'll also note that the higher-level code that calls `wait_for_service` seems to expect that there to be an eventual timeout around the retry loop, which should also implicitly handle this case (see the error types returned by [`RunningZone::boot`](https://github.com/oxidecomputer/omicron/blob/26b33a11962d82b29fde9a2e3233d038d5495c44/illumos-utils/src/running_zone.rs#L303-L306) and [the code in `instance_manager` that waits for the Propolis service](https://github.com/oxidecomputer/omicron/blob/26b33a11962d82b29fde9a2e3233d038d5495c44/sled-agent/src/instance.rs#L2261)). However, as far as I can tell, it looks like there is no actual timeout here. Note that `wait_for_service` just calls `backoff::retry_notify_ext` with `retry_policy_local`, which configures the backoff policy thusly: https://github.com/oxidecomputer/omicron/blob/26b33a11962d82b29fde9a2e3233d038d5495c44/common/src/backoff.rs#L97-L102

This never calls the [`ExponentialBackoffBuilder::with_max_elapsed_time` method](https://docs.rs/backoff/latest/backoff/exponential/struct.ExponentialBackoffBuilder.html#method.with_max_elapsed_time), which is how one sets a timeout in the `backoff` crate's API. So, as far as I can tell, there will never be a timeout, and this will retry indefinitely. So, we might want to fix that as well, or at least change up the error returned in that case.

	let fmri = "svc:/milestone/single-user:default";
	zone.zones_api
	.wait_for_service(Some(&zone.name), fmri, zone.log.clone())
	.await
	.map_err(\|_\| BootError::Timeout {
	service: fmri.to_string(),
	zone: zone.name.to_string(),
	})?;

	pub fn retry_policy_local() -> ::backoff::ExponentialBackoff {
	backoff_builder()
	.with_initial_interval(Duration::from_millis(50))
	.with_max_interval(Duration::from_secs(1))
	.build()
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled-agent should handle Propolis zones being deleted before the service has come up #9688

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	let running_zone = RunningZone::boot(installed_zone).await?;
	info!(self.log, "Started propolis in zone: {}", zname);

sled-agent should handle Propolis zones being deleted before the service has come up #9688

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions