Gradient vanishing of group attention during pre-training #487

gwyanCN · 2026-04-10T10:37:03Z

gwyanCN
Apr 10, 2026

Hello, we are currently trying to implement our own multivariate Chronos model. During pre-training on both the synthetic and real-world data described in the paper, we found that as training progresses, the gradient ratio between time attention and group attention becomes increasingly imbalanced (fig 1), and group attention gradually suffers from gradient vanishing. Furthermore, during inference testing, we observed that group attention has no effect (fig 2). Have you encountered this issue before, and if so, how did you resolve it?

fig1:

fig2:
the attention weights of group attention in the first layer and the tenth layer.
Above is ours, below is the open-source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient vanishing of group attention during pre-training #487

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Gradient vanishing of group attention during pre-training #487

Uh oh!

gwyanCN Apr 10, 2026

Replies: 0 comments

gwyanCN
Apr 10, 2026