Skip to content

quicker unhappy jobs #233

@cortlandstarrett

Description

@cortlandstarrett

At present, an unhappy job finishes after a timeout of the hanging job timer.
The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.

There are a few problems with this:

  1. It reuses a timer that is intended for other purposes.
  2. It is not configurable separately from the hanging job.
  3. It it a long timer which means that unhappy jobs are held in memory for a long time increasing the number of concurrent jobs such that it could be a memory risk. At 50 jobs per second and a 30 second hanging job timer, this could expand to 1500 jobs waiting to end. This impacts our max jobs per worker setting.

It might be good to add a configuration value for this timer.
Another option is to use the intra-event timer which can be very short.

A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions