Skip to content

sled-agent should handle Propolis zones being deleted before the service has come up #9688

@hawkw

Description

@hawkw

Thanks to @jmpesp chaos-monkeying the control plane in the Canada region, we have discovered that sled-agent does not seem to properly handle the abrupt disappearance of a Propolis zone (e.g. due to a zoneadm halt and zonecfg delete -F) before the Propolis service in that zone has actually started.

Previously, we merged #7794 in an attempt to fix sled-agent's confusion in the event of forceful deletion of Propolis zones (see #7563). However, this change only worked by making the InstanceStateMonitor task check whether the zone still exists when it sees an HTTP communication error while trying to check on the instance's state. This is what would occur if a Propolis zone had started successfully, sled-agent had observed it starting and began monitoring the instance, and then the zone was deleted. However, in the case where the zone is deleted before the propolis service has started, we don't see that happen.

For instance, James posts the following sled-agent log fragment:

19:22:18.438Z WARN SledAgent (InstanceManager): wait for service svc:/milestone/single-user:default in zone Some("oxz_propolis-server_17398ae4-34dc-4e5b-89e7-18bf70d4526d") failed: Property not found. retry in 745.248441ms
    file = illumos-utils/src/svc.rs:36
    instance_id = f5a2b1fe-c737-4334-be88-0d7eb8d3451c
    propolis_id = 17398ae4-34dc-4e5b-89e7-18bf70d4526d
    zone = oxz_propolis-server_17398ae4-34dc-4e5b-89e7-18bf70d4526d
19:22:18.602Z WARN SledAgent (InstanceManager): wait for service svc:/milestone/single-user:default in zone Some("oxz_propolis-server_fce209bb-e6b7-4935-85f8-b5c2c041f44c") failed: Property not found. retry in 689.749231ms
    file = illumos-utils/src/svc.rs:36
    instance_id = b8519513-c81d-4e01-ab44-f0da6ab17876
    propolis_id = fce209bb-e6b7-4935-85f8-b5c2c041f44c
    zone = oxz_propolis-server_fce209bb-e6b7-4935-85f8-b5c2c041f44c

This is logged by the wait_for_service function in illumos-utils.

I'm pretty sure that means we were in RunningZone::boot, which calls wait_for_service with the FMRI svc:/milestone/single-user:default:

let fmri = "svc:/milestone/single-user:default";
zone.zones_api
.wait_for_service(Some(&zone.name), fmri, zone.log.clone())
.await
.map_err(|_| BootError::Timeout {
service: fmri.to_string(),
zone: zone.name.to_string(),
})?;

That, in turn, was called by InstanceManager here:

let running_zone = RunningZone::boot(installed_zone).await?;
info!(self.log, "Started propolis in zone: {}", zname);

I think we can fix this by making wait_for_service also check whether the zone exists at all when we see a "property not found" error, and return a non-retryable error in that case. However, we should be careful to make sure there isn't a possible race there --- can we also see that error in a case where a zone hasn't yet been created? I'm not sure.

As an aside, I'll also note that the higher-level code that calls wait_for_service seems to expect that there to be an eventual timeout around the retry loop, which should also implicitly handle this case (see the error types returned by RunningZone::boot and the code in instance_manager that waits for the Propolis service). However, as far as I can tell, it looks like there is no actual timeout here. Note that wait_for_service just calls backoff::retry_notify_ext with retry_policy_local, which configures the backoff policy thusly:

pub fn retry_policy_local() -> ::backoff::ExponentialBackoff {
backoff_builder()
.with_initial_interval(Duration::from_millis(50))
.with_max_interval(Duration::from_secs(1))
.build()
}

This never calls the ExponentialBackoffBuilder::with_max_elapsed_time method, which is how one sets a timeout in the backoff crate's API. So, as far as I can tell, there will never be a timeout, and this will retry indefinitely. So, we might want to fix that as well, or at least change up the error returned in that case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and ManagementbugSomething that isn't working.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions