Problem
The codebase works, but several core modules have accumulated too much responsibility in single files. This is intentionally blunt: parts of the implementation now read like ai slop, where CLI orchestration, external I/O, domain rules, scoring heuristics, rendering, and curated data all sit together.
That makes future ranking changes riskier than they need to be. Small edits to benchmark logic, model fetching, output formatting, or CLI behavior can have a wide blast radius because the boundaries are not clean.
Current hotspots
Measured with wc -l:
1087 src/whichllm/cli.py
880 src/whichllm/models/fetcher.py
835 src/whichllm/output/display.py
803 src/whichllm/engine/ranker.py
751 src/whichllm/models/benchmark.py
404 src/whichllm/constants.py
Tests with similar size pressure:
574 tests/test_cli.py
553 tests/test_ranker.py
500 tests/test_p1_p3_regressions.py
459 tests/test_r3_regressions.py
Suggested direction
Refactor toward clearer responsibility boundaries:
- Keep
cli.py as a thin Typer layer: option parsing, validation, and command dispatch only.
- Add application/use-case modules for command flows, e.g.
recommend, plan, upgrade, run, snippet, and hardware.
- Split
ranker.py into candidate generation, filtering, scoring, family deduplication, and ranking orchestration.
- Split
benchmark.py into benchmark fetching, source normalization, score lookup, and evidence resolution.
- Split
fetcher.py into HuggingFace client calls, model parsing, GGUF extraction, MoE/param overrides, and serialization.
- Split
display.py by output surface: ranking, plan, upgrade, JSON, and shared formatting helpers.
- Move large curated registries out of
constants.py where practical, or at least separate GPU, quantization, lineage, and model override data.
Acceptance criteria
- Core runtime modules are below 400 lines where practical, excluding intentionally data-only files.
- CLI command functions do not contain model loading, benchmark fetching, ranking, and rendering logic inline.
- Ranking behavior is preserved by existing tests.
- The current regression suite still passes.
- New module boundaries make it possible to modify benchmark evidence, scoring, or Rich output without touching unrelated command code.
Non-goals
- Do not change ranking behavior as part of the first cleanup unless required by extraction.
- Do not reformat the entire codebase mechanically.
- Do not remove existing regression tests; split them only if it improves maintainability.
Problem
The codebase works, but several core modules have accumulated too much responsibility in single files. This is intentionally blunt: parts of the implementation now read like ai slop, where CLI orchestration, external I/O, domain rules, scoring heuristics, rendering, and curated data all sit together.
That makes future ranking changes riskier than they need to be. Small edits to benchmark logic, model fetching, output formatting, or CLI behavior can have a wide blast radius because the boundaries are not clean.
Current hotspots
Measured with
wc -l:Tests with similar size pressure:
Suggested direction
Refactor toward clearer responsibility boundaries:
cli.pyas a thin Typer layer: option parsing, validation, and command dispatch only.recommend,plan,upgrade,run,snippet, andhardware.ranker.pyinto candidate generation, filtering, scoring, family deduplication, and ranking orchestration.benchmark.pyinto benchmark fetching, source normalization, score lookup, and evidence resolution.fetcher.pyinto HuggingFace client calls, model parsing, GGUF extraction, MoE/param overrides, and serialization.display.pyby output surface: ranking, plan, upgrade, JSON, and shared formatting helpers.constants.pywhere practical, or at least separate GPU, quantization, lineage, and model override data.Acceptance criteria
Non-goals