At present, an unhappy job finishes after a timeout of the hanging job timer.
The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.
There are a few problems with this:
- It reuses a timer that is intended for other purposes.
- It is not configurable separately from the hanging job.
- It it a long timer which means that unhappy jobs are held in memory for a long time increasing the number of concurrent jobs such that it could be a memory risk. At 50 jobs per second and a 30 second hanging job timer, this could expand to 1500 jobs waiting to end. This impacts our max jobs per worker setting.
It might be good to add a configuration value for this timer.
Another option is to use the intra-event timer which can be very short.
A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.
At present, an unhappy job finishes after a timeout of the hanging job timer.
The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.
There are a few problems with this:
It might be good to add a configuration value for this timer.
Another option is to use the intra-event timer which can be very short.
A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.