Skip to content
This repository was archived by the owner on May 19, 2026. It is now read-only.
This repository was archived by the owner on May 19, 2026. It is now read-only.

Possible Memory Estimation Issue Leading to OOMs and Restarts #72

@VibhuJawa

Description

@VibhuJawa

Description:

We're experiencing frequent OOMs and restarts during batch processing. The system repeatedly downsizes batches due to insufficient memory, but it's unclear if memory estimation is the root cause or if something else is contributing to the problem.

###Stack Trace Highlights:

The following warnings and errors were observed, indicating repeated attempts to allocate memory for batch sizes that the system cannot handle:

Not enough memory for a batch size of 1024. Retrying with a new batch size of 512.
Not enough memory for a batch size of 512. Retrying with a new batch size of 256.
Not enough memory for a batch size of 256. Retrying with a new batch size of 128.
Not enough memory for a batch size of 128. Retrying with a new batch size of 64

Action Needed:

We need to investigate the memory estimation logic and other potential causes to address the instability.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions