We had an interesting failure mode where our galera cluster became inaccessible for about 8-9 seconds. That was long enough for every cavalcade worker to crash, then fail starting enough times that systemd gave up. We're now using RestartSec=300 which is probably too conservative, but I figure five minutes is long enough for transitory failures without having a negative impact on site performance.
We had an interesting failure mode where our galera cluster became inaccessible for about 8-9 seconds. That was long enough for every cavalcade worker to crash, then fail starting enough times that systemd gave up. We're now using
RestartSec=300which is probably too conservative, but I figure five minutes is long enough for transitory failures without having a negative impact on site performance.