Hi @abdulfatir and the Chronos team,
First, I sincerely apologize for the disruption caused by my previous PRs (#454, #456). I understand that opening significant architectural changes without prior discussion creates unnecessary noise, especially when they deviate from the project's core roadmap.
I am currently deploying Chronos in a high-throughput production environment and have identified two specific bottlenecks. I wanted to share my findings and ask if architectural support for these use cases aligns with your long-term goals.
1. High-Throughput Inference (Removing the CPU-GPU Sync)
I profiled the predict() loop and noticed that moving tensors between CPU and GPU at every generation step acts as a significant bottleneck for low-latency applications.
- Experiment: I implemented a generation loop that keeps the context and predictions entirely on VRAM until completion.
- Result: On local benchmarks (MPS/CUDA), this yielded a ~5x improvement in throughput for batch inference.
- Proposal: Instead of modifying the core
ChronosModel, would you be open to an optional ChronosFastPipeline (or similar utility) specifically designed for production inference where latency is critical?
2. Static Covariates for Fine-Tuning
I reviewed the discussion in #352 and understand that pretrained checkpoints do not support static covariates. However, for users fine-tuning on retail datasets (where item metadata is constant), repeating static features across the temporal dimension significantly increases memory usage.
- Proposal: Would you consider accepting a
static_embedding module in the architecture that is disabled by default?
- Benefit: This would allow advanced users to fine-tune custom models with metadata efficiently, without breaking compatibility for users of the pretrained checkpoints.
I am happy to keep these optimizations in my own fork if they are out of scope, but I wanted to offer them properly in case they benefit the community.
Thanks for your work on this SOTA model.
Hi @abdulfatir and the Chronos team,
First, I sincerely apologize for the disruption caused by my previous PRs (#454, #456). I understand that opening significant architectural changes without prior discussion creates unnecessary noise, especially when they deviate from the project's core roadmap.
I am currently deploying Chronos in a high-throughput production environment and have identified two specific bottlenecks. I wanted to share my findings and ask if architectural support for these use cases aligns with your long-term goals.
1. High-Throughput Inference (Removing the CPU-GPU Sync)
I profiled the
predict()loop and noticed that moving tensors between CPU and GPU at every generation step acts as a significant bottleneck for low-latency applications.ChronosModel, would you be open to an optionalChronosFastPipeline(or similar utility) specifically designed for production inference where latency is critical?2. Static Covariates for Fine-Tuning
I reviewed the discussion in #352 and understand that pretrained checkpoints do not support static covariates. However, for users fine-tuning on retail datasets (where item metadata is constant), repeating static features across the temporal dimension significantly increases memory usage.
static_embeddingmodule in the architecture that is disabled by default?I am happy to keep these optimizations in my own fork if they are out of scope, but I wanted to offer them properly in case they benefit the community.
Thanks for your work on this SOTA model.