This script runs inference for a language model using Hugging Face Transformers, supporting chat-style message input and optional system message prepending.
- Python 3.8+
- torch
- transformers
Install dependencies:
pip install torch transformersRun inference with a model and a JSON file containing messages:
python inference.py --model_name <model_path_or_name> --message_file messages.jsonYou can chat interactively with the model. After each response, you can enter a new message (including an empty message) and continue the conversation. Type exit or quit to end the session.
Start interactive mode with no initial message:
python inference.py --model_name <model_path_or_name>Or, after running with a message or message file, you will be prompted to continue interactively.
To prepend a system message from a text file:
python inference.py --model_name <model_path_or_name> --message_file messages.json --system_message_file system.txtYou can also provide a single message directly:
python inference.py --model_name <model_path_or_name> --message "Hello, how are you?"--torch_dtype: Set torch dtype (e.g., auto, float16, bfloat16, float32)--max_new_tokens: Max new tokens to generate (default: 1000)--device: Device to run the model on (e.g., cpu, cuda:0, mps)
The --message_file should be a JSON file containing a list of messages, e.g.:
[
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help you?"}
]- You can send empty messages (just press Enter) in interactive mode; these will be appended and sent to the model.
The --system_message_file should be a plain text file. Its content will be prepended as a system message, e.g.:
You are a helpful assistant.
python inference.py --model_name Qwen/Qwen3-1.7B --message_file messages.json --system_message_file system.txt --device cpupython inference.py --model_name Qwen/Qwen3-1.7BMIT