Description
I have been pondering which LLM to run (when I have time) for some months, and watched as new models get announced and new ways of running them are announced (MTP) and new distilled versions are announced etc. and I have read a lot of Reddit posts for people wanting to do similar things with similar hardware. So I have a reasonable idea of what might be best for my hardware.
The output is below...
For some reason it failed to download some stuff and gave an error. But when I ran it again it didn't give an error but gave the exact same results.
A few things in the results that stood out:
- The specific use case(s) matter - but there is no way for me to state I want e.g. agentic coding
- Qwen3.6 27B dense rather than Qwen3.6 35B A3B MoE which would run much better with hybrid inference.
- No TPS estimates - which are absolutely essential for evaluating LLMs - 35Tps vs. 1Tps is a huge impact
- Q8 rather than Q5 or Q6? Really?
- No MTP evaluations
- No distil evaluations
- No data regarding which runner should be used with which params
Steps to Reproduce
Run whichllm.
Hardware Info
Leaderboard fetch failed: Client error '429 Too Many Requests' for url 'https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard%2Fcontents&config=default&split=train&offset=3800&length=100'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
AA Index fetch failed, will use fallback: __NEXT_DATA__ payload not found
╭──────────────────────────────────────────────────────────────────────────────────────────────── Hardware Info ────────────────────────────────────────────────────────────────────────────────────────────────╮
│ GPU 0: NVIDIA RTX A3000 Laptop GPU — 6.0 GB (CUDA 13.2) — BW: N/A │
│ GPU 1: Intel(R) UHD Graphics — shared memory — BW: N/A │
│ CPU: Unknown CPU — 8 cores (AVX2) │
│ RAM: 31.3 GB │
│ Disk free: 224.8 GB │
│ OS: windows │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Recommended Models
┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ # ┃ Model ┃ Params ┃ Quant ┃ Published ┃ Downloads ┃ Score ┃ License ┃
┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ 1 │ Qwen/Qwen3.6-27B │ 27.8B │ Q8_0 │ 2026-04-21 │ 5.2M │ 56.7 │ apache-… │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 2 │ google/gemma-4-31B-it │ 32.7B │ Q6_K │ 2026-03-11 │ 11.3M │ 54.3 │ apache-… │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 3 │ google/gemma-4-26B-A4B-it │ 26.5B │ Q8_0 │ 2026-03-11 │ 11.5M │ 47.5 │ apache-… │
│ │ │ (3.8B… │ │ │ │ │ │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 4 │ Qwen/Qwen3-30B-A3B │ 30.5B │ Q6_K │ 2025-04-27 │ 2.1M │ 47.5 │ apache-… │
│ │ │ (3.0B… │ │ │ │ │ │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 5 │ zai-org/GLM-4.7-Flash │ 31.2B │ Q6_K │ 2026-01-19 │ 1.1M │ 45.8 │ mit │
│ │ │ (12.0… │ │ │ │ │ │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 6 │ Qwen/QwQ-32B │ 32.8B │ Q6_K │ 2025-03-05 │ 62.5K │ 45.4 │ apache-… │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 7 │ openai/gpt-oss-20b │ 21.5B │ Q8_0 │ 2025-08-04 │ 7.9M │ 45.0 │ apache-… │
│ │ │ (3.6B… │ │ │ │ │ │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 8 │ deepseek-ai/DeepSeek-R1-Distill-Qwen-32B │ 32.8B │ Q6_K │ 2025-01-20 │ 608.3K │ 44.6 │ mit │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 9 │ mistralai/Mistral-Small-3.2-24B-Instruct-2506 │ 24.0B │ Q8_0 │ 2025-06-19 │ 632.7K │ 43.9 │ apache-… │
├─────┼───────────────────────────────────────────────┼────────┼────────┼────────────┼───────────┼───────┼──────────┤
│ 10 │ Qwen/Qwen3-14B │ 14.8B │ Q8_0 │ 2025-04-27 │ 1.7M │ 43.3 │ apache-… │
└─────┴───────────────────────────────────────────────┴────────┴────────┴────────────┴───────────┴───────┴──────────┘
Top pick confidence: Low (direct benchmark, gap +2.3, partial offload)
Benchmark reference: 2026-05 curated snapshot; live AA / LiveBench / Aider merged when reachable.
Speed caution: Low-confidence speed estimates in top ranks: #1, #2, #3
Warning #1 Qwen3.6-27B: ~81% of layers will be offloaded to CPU RAM
Warning #2 gemma-4-31B-it: ~79% of layers will be offloaded to CPU RAM
Warning #3 gemma-4-26B-A4B-it: ~78% of layers will be offloaded to CPU RAM
Python Version
3.14
Operating System
Windows 11
whichllm Version
0.5.7
Description
I have been pondering which LLM to run (when I have time) for some months, and watched as new models get announced and new ways of running them are announced (MTP) and new distilled versions are announced etc. and I have read a lot of Reddit posts for people wanting to do similar things with similar hardware. So I have a reasonable idea of what might be best for my hardware.
The output is below...
For some reason it failed to download some stuff and gave an error. But when I ran it again it didn't give an error but gave the exact same results.
A few things in the results that stood out:
Steps to Reproduce
Run
whichllm.Hardware Info
Python Version
3.14
Operating System
Windows 11
whichllm Version
0.5.7