Skip to content

Unslop the codebase: pay down ai slop in oversized core modules #41

@Andyyyy64

Description

@Andyyyy64

Problem

The codebase works, but several core modules have accumulated too much responsibility in single files. This is intentionally blunt: parts of the implementation now read like ai slop, where CLI orchestration, external I/O, domain rules, scoring heuristics, rendering, and curated data all sit together.

That makes future ranking changes riskier than they need to be. Small edits to benchmark logic, model fetching, output formatting, or CLI behavior can have a wide blast radius because the boundaries are not clean.

Current hotspots

Measured with wc -l:

1087  src/whichllm/cli.py
 880  src/whichllm/models/fetcher.py
 835  src/whichllm/output/display.py
 803  src/whichllm/engine/ranker.py
 751  src/whichllm/models/benchmark.py
 404  src/whichllm/constants.py

Tests with similar size pressure:

574  tests/test_cli.py
553  tests/test_ranker.py
500  tests/test_p1_p3_regressions.py
459  tests/test_r3_regressions.py

Suggested direction

Refactor toward clearer responsibility boundaries:

  • Keep cli.py as a thin Typer layer: option parsing, validation, and command dispatch only.
  • Add application/use-case modules for command flows, e.g. recommend, plan, upgrade, run, snippet, and hardware.
  • Split ranker.py into candidate generation, filtering, scoring, family deduplication, and ranking orchestration.
  • Split benchmark.py into benchmark fetching, source normalization, score lookup, and evidence resolution.
  • Split fetcher.py into HuggingFace client calls, model parsing, GGUF extraction, MoE/param overrides, and serialization.
  • Split display.py by output surface: ranking, plan, upgrade, JSON, and shared formatting helpers.
  • Move large curated registries out of constants.py where practical, or at least separate GPU, quantization, lineage, and model override data.

Acceptance criteria

  • Core runtime modules are below 400 lines where practical, excluding intentionally data-only files.
  • CLI command functions do not contain model loading, benchmark fetching, ranking, and rendering logic inline.
  • Ranking behavior is preserved by existing tests.
  • The current regression suite still passes.
  • New module boundaries make it possible to modify benchmark evidence, scoring, or Rich output without touching unrelated command code.

Non-goals

  • Do not change ranking behavior as part of the first cleanup unless required by extraction.
  • Do not reformat the entire codebase mechanically.
  • Do not remove existing regression tests; split them only if it improves maintainability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions