If I add the --thinking flag to this, then the client doesn't get the beginning <think> tag from the model.
For tool calls the client actually gets the right output at first, but then model doesn't see it in its history that it sent the first line or so, and then begins trying to omit the first line intentionally to match what it sees that it thinks worked.
Without thinking, tool calls succeed and the output looks correct in the client.
./SwiftLM \
--model ~/.lmstudio/models/mlx-community/Qwen3.6-27B-OptiQ-4bit/ \
--host 0.0.0.0 \
--port 11234 \
--api-key "$(head -n 1 ~/.config/agents/tokens.txt | tr -d ' \t\r\n')" \
--gpu-layers auto \
--prefill-size 64 \
--ctx-size 130000
I doubt it's relevant, but I'm on a Mac Studio M2 Max 64gb (also --prefill-size 128 was too high and I was getting GPU timeouts).
sidenote: vllm-mlx has the same problem. I wonder if it's a config or usage issue of MLX. llama w/ metal and LM Studio (GGUF or MLX) don't have this problem.
If I add the
--thinkingflag to this, then the client doesn't get the beginning<think>tag from the model.For tool calls the client actually gets the right output at first, but then model doesn't see it in its history that it sent the first line or so, and then begins trying to omit the first line intentionally to match what it sees that it thinks worked.
Without thinking, tool calls succeed and the output looks correct in the client.
I doubt it's relevant, but I'm on a Mac Studio M2 Max 64gb (also
--prefill-size 128was too high and I was getting GPU timeouts).sidenote: vllm-mlx has the same problem. I wonder if it's a config or usage issue of MLX. llama w/ metal and LM Studio (GGUF or MLX) don't have this problem.