Considering using vLLM for LLM-Inference

Instead of using just Hugging Face Transformers, consider using vLLM:

https://github.com/vllm-project/vllm

this does continuous batching and is written in Python.