Skip to content

sungurerdim/aicodermap

AICoderMap

License: MIT Live demo

Decide which coding model to use, in seconds. Ranked by your priorities, source contradictions surfaced, local options labeled with the exact quant that fits your GPU. Available in English and Turkish in the live UI.


Why AICoderMap?

A new coding LLM ships every two weeks — Opus 4.7, Kimi K2.6, Qwen3.6-27B, DeepSeek V4 in just the last month. When you actually need to pick one, today's trackers leave you stuck:

  • artificialanalysis.ai / llm-stats / BenchLM force you into hours of review — no opinion, conflicting scores side by side, you interpret them.
  • aider.chat has not updated since November 2025 (5 months stale). You are deciding from rotten data.
  • No Turkish coverage anywhere. Translating each global benchmark page is a separate time tax.

AICoderMap answers the questions that actually shape the decision:

Question AICoderMap's answer
"Which model fits my workflow?" Slide the weights to your priorities (SWE-focused, agentic-focused, balanced, or custom) — ranking updates instantly. Four built-in presets plus your own custom mix.
"SWE-Verified 87 vs SWE-Pro 64 — which is real?" Every score carries a ⚠ / 🚨 flag when sources disagree. The tooltip lists each source (Anthropic / Scale SEAL / community) with its tier (S = self-reported, I = independent, C = community). Decide on raw evidence, not inflated headlines.
"What runs on my RTX 3070?" WebGPU auto-detects your hardware on page load. Every local model gets a label like "Fits (10 GB · UD-IQ2_XXS)" — exact quant name plus GB. For models that overflow, you also see "+3 GB RAM" offload recommendations.
"How fresh is the data?" Each row shows its last-updated date. Refreshed every 14 days at most — none of Aider's 5-month staleness.
"I want to share this on Twitter/Discord." One click → PNG export for a card, the comparison table, or the full page. No screenshot fiddling.
"I want to read it in Turkish." Every UI label and every benchmark description is written in TR — no translation tax to use the tool.

🚀 Status

Pre-launch — implementation in progress. 5-week solo part-time plan (M1 Foundation → M5 Launch).

Document What it covers
PRD Product Requirements (users, features, metrics)
TechSpec Technical Specification (architecture, data, API, security)
ImplGuide ⭐ Coding-ready implementation guide
Tasks 23-task / 5-milestone breakdown
Workflow Update workflow (14 happy-path + 5 exception steps)
Pitch Short pitch — for sharing

🛠️ Stack

Single external service: GitHub Pages. Ongoing cost: $0.

  • Vanilla HTML/CSS/JS (no build step, no framework)
  • Static JSON data files (the skill regenerates them)
  • WebGPU API (browser-native GPU detect)
  • html2canvas (vendored, PNG export)
  • Local Claude Code skill + research agent (manual update workflow)

🧰 Development setup

A fresh clone needs only Python 3.10+, Node 18+ (for the linter), and Git Bash on Windows (or any POSIX shell). No package manager, no build step.

Prerequisites:

  • Python 3.10+ (uses re.fullmatch, walrus operator, modern type hints) — standard library only, no pip install needed.
  • Node 18+ — only for scripts/regex-lint.js (validates the regex corpus). Skip if you don't touch the regex library.
  • Git Bash / WSL / Linux / macOS for auto/bench.sh. On native Windows cmd.exe, use auto\bench.bat instead.
  • Claude Code (any recent version) for /aicodermap skill triggers — the skill + agent are project-scoped and load automatically when you open this directory.

Local commands (no install required, all stdlib / vendored):

# Run the ds-tune evaluation (slug correctness + coverage proxy):
bash auto/bench.sh        # POSIX
auto\bench.bat            # Windows native

# Update verification map from the latest agent artifact:
python scripts/verification-map.py update

# Bootstrap verification map from accumulated sources.json (one-shot):
python scripts/verification-map.py bootstrap

# Merge an agent run into data/* (called by the skill, but also runnable standalone):
python scripts/merge.py

# Lint the regex corpus (Node):
node scripts/regex-lint.js

# Migrate schema (rare, one-shot):
python scripts/migrate-schema.py

# OCR images embedded in vendor blog posts:
python scripts/extract-images.py <url>

# Regenerate sources-whitelist.json `format` keys from the schema:
node scripts/whitelist-format-migration.js

Live preview: open index.html in any modern browser, no server needed. The site reads data/*.json over file://. For deploy verification, use https://sungurerdim.github.io/aicodermap/.

What is NOT in the repo (gitignored, regenerated on demand):

  • .aicodermap-agent-out.json — last research-agent return (overwritten every cycle)
  • .aicodermap-verification-map.json — cross-cycle confirmed-cell cache (run python scripts/verification-map.py bootstrap to rebuild from data/sources.json)
  • .aicodermap-images/ — temporary PNG downloads for OCR
  • *.bak, *.bak2, *.bak3 — rotated backups created by merge.py
  • auto/run.log — eval output (regenerated by bench.sh/bench.bat)
  • .ruff_cache/, __pycache__/ — linter / Python bytecode caches

Everything else (skill, agent, scripts, data, vendor JS, docs, i18n, auto/ folder including fixtures + program.md + results.tsv) is tracked. A fresh clone is reproducible end-to-end.


📋 Roadmap (5 weeks)

  • M1 Foundation (Week 1) — Repo + 4 JSON schemas + research agent
  • M2 Core (Week 2) — Live tracker static render, TR/EN toggle
  • M3 Integration (Week 3) — 13 must-have features (weights editor + GPU VRAM + contradiction flags + PNG). Bench cross-source coverage continues to climb each refresh — current M4 floor is 30 %, target ≥ 95 %.
  • M4 Polish (Week 4) — SEO (sitemap, robots, JSON-LD, hreflang, OG/Twitter), a11y skip-link + focus rings, mobile card-stack, doc drift sweep, smoke harness.
  • M5 Launch (Week 5) — Simultaneous TR + Global soft launch + 2-week validation

🔗 Shareable deep-links

Every page state is reflected in the URL — copy the address bar (or click Copy share link in the Export section) and you've shared the exact ranking the other person will see. The state schema is human-readable and stable; CLI consumers can construct URLs by hand.

Param Values
lang tr | en
theme dark | light
preset balanced | swe-focused | agentic-focused | reasoning-focused | benchmark-only | custom
w comma-separated benchKey:weight pairs; honoured only when preset=custom
tier frontier | open-flagship | coder-specialized | gemma | ollama-local | all
deployment all | cloud | local
provider vendor name (URL-encoded) | all
vram integer GB (1..256)
gpu webgpu vendor key (e.g. nvidia.rtx-4090) | auto
open 1 | 0 (open-license-only filter)
search substring (URL-encoded)
sort <columnKey>-<asc|desc> (e.g. swePro-desc, composite-asc)

Example deep-link — Turkish UI, SWE-focused preset, only models that fit a 16 GB RTX 4080, sorted by SWE-bench Pro descending:

https://sungurerdim.github.io/aicodermap/?lang=tr&preset=swe-focused&deployment=local&vram=16&sort=swePro-desc

🧪 Programmatic Access (CLI / agent friendly)

The site is a static GitHub Pages deploy backed by stable JSON files. Any shell tool that can curl + jq can consume it; no auth, no rate limit, no WAF. The schemas are documented in docs/TECHSPEC.md §3.

BASE=https://sungurerdim.github.io/aicodermap

# All models, just id + provider + tier + composite-relevant scores:
curl -s "$BASE/data/models.json" | jq '
  [.[] | {id, name, provider, tier, open,
          swePro: .bench.swePro, sweV: .bench.sweV,
          lcb:    .bench.lcb,    tb2:  .bench.tb2,
          priceIn: .pricing.api[0].in, priceOut: .pricing.api[0].out}]
'

# Top-10 frontier models by SWE-bench Pro:
curl -s "$BASE/data/models.json" | jq '
  [.[] | select(.tier=="frontier" and .bench.swePro != null)
       | {id, swePro: .bench.swePro, priceIn: .pricing.api[0].in}]
  | sort_by(-.swePro) | .[:10]
'

# Models that fit a 16 GB GPU (vramRequirement <= 16, including null = cloud):
curl -s "$BASE/data/models.json" | jq '
  [.[] | select(.vramRequirement != null and .vramRequirement <= 16)
       | {id, vram: .vramRequirement, license, swePro: .bench.swePro}]
'

# Cross-source contradictions for a model (≥3pp delta):
curl -s "$BASE/data/sources.json" | jq '
  to_entries | map(select(
    (.key | startswith("opus-4-7."))
    and ([.value[].value] | (max - min)) >= 3
  ))
'

# Pull provenance for a single (model, bench) cell:
curl -s "$BASE/data/sources.json" | jq '."opus-4-7.swePro"'

A consumer that wants the same view a sharable URL produces can fetch the URL directly and read the <script type="application/ld+json"> block — it contains the Dataset schema with distribution[] pointing at the three canonical JSON files.

# Discover the dataset distribution from JSON-LD:
curl -s "$BASE/" | grep -oP 'application/ld\+json[^<]*<[^>]*>([\s\S]+?)</script>' | head -200

📚 Data Sources

Every value the tracker shows comes from one of the sources below. Each value carries a trustScore based on its tier (I > S > C > U), the number of confirming sources, and recency. Independent sources outweigh vendor self-reports; community sources are used only when no independent or official source exists; forum/social signals are never written into the data.

I-tier — Independent benchmarks & leaderboards

Source URL Authority for
Scale SEAL labs.scale.com/leaderboard SWE-bench Pro (1865 tasks), HLE
SWE-bench (canonical) swebench.com · github.com/SWE-bench/experiments SWE-bench Verified, full SWE-bench
LiveCodeBench livecodebench.github.io · livecodebench.com LCB v6 (contamination-free)
Terminal-Bench tbench.ai · terminal-bench.io TB2 agentic execution
tau-bench tau-bench.dev tau2 agentic API-use
Aider Polyglot aider.chat/docs/leaderboards aider (warn: stale since Nov 2025)
MCP-Atlas mcp-atlas.dev mcpA tool-chain quality
Artificial Analysis artificialanalysis.ai/leaderboards aaIdx, aaCoding, aaAgentic, throughput, pricing
Vellum Leaderboard vellum.ai/llm-leaderboard independent SWE-V, GPQA, cost+latency
llm-stats llm-stats.com broad model catalog
LMArena lmarena.ai blind human preference (formerly LMSYS)
LiveBench livebench.ai contamination-resistant rotating evals
Berkeley BFCL gorilla.cs.berkeley.edu function-calling v3/v4
BigCodeBench bigcode-bench.github.io · HF leaderboard code generation gold standard
EvalPlus evalplus.github.io HumanEval+ / MBPP+ rigorous
HF Open LLM Leaderboard huggingface.co/spaces/open-llm-leaderboard open-weight canonical aggregation
Klu.ai klu.ai/llm-leaderboard broader benchmark aggregator
Papers with Code paperswithcode.com/area/code-generation peer-reviewed leaderboards
arXiv arxiv.org original benchmark papers
BenchLM benchlm.ai verified vs provisional transparency; ProgramBench tracker
ProgramBench programbench.com · arXiv 2605.03546 cleanroom program reconstruction (Meta + Stanford + Harvard, 2026-05-05)
AgentBench agentbench.ai multi-domain agentic
MathArena matharena.ai AIME math reasoning (auxiliary)
Vals.ai vals.ai/benchmarks enterprise-gated benchmark sets
LMMarketCap lmmarketcap.com hourly market table

I-tier — Multi-provider pricing & availability

Models are often hosted on multiple providers at different prices. The tracker shows per-provider pricing in each card and a price range in the comparison table — these are the sources surveyed.

Source URL Extracts
OpenRouter openrouter.ai provider count, uptime%, alt pricing, throughput
Together AI together.ai/models quant variants, $/1M, batch tier
Fireworks AI fireworks.ai/models tier, throughput, batch pricing
DeepInfra deepinfra.com/models $/1M, throughput
Groq console.groq.com/docs/models · groq.com/pricing extreme-fast inference rates
Cerebras inference-docs.cerebras.ai · cerebras.ai/inference ultra-fast inference
SambaNova Cloud cloud.sambanova.ai/models catalog, throughput
Replicate replicate.com open-weight hosting, $/sec
Lepton AI lepton.ai/pricing enterprise pricing
Novita AI novita.ai/model-api catalog + pricing
SiliconFlow siliconflow.cn/models Chinese providers — Qwen / DeepSeek / MiMo
Anyscale anyscale.com/endpoints enterprise endpoints
Cloudflare Workers AI developers.cloudflare.com/workers-ai/models edge regions, free tier
AWS Bedrock aws.amazon.com/bedrock enterprise + region matrix
Azure AI Foundry ai.azure.com/explore/models enterprise + region
HuggingFace Inference Endpoints huggingface.co author canonical card
OpenCode Zen / Go opencode.ai edge endpoints, latency
Lambda Cloud lambda.ai/inference enterprise throughput
Tensorix tensorix.ai infrastructure / niche frontier hosting

I-tier — Local runtimes, quants, GPU compatibility

Source URL Extracts
Ollama Library ollama.com/library tags, pullCount, architecture, parameters, license, releasedISO
HuggingFace Unsloth huggingface.co/unsloth UD dynamic quants (UD-IQ1_S → UD-Q4_K_XL)
HuggingFace bartowski huggingface.co/bartowski most-active quant maintainer
HuggingFace mradermacher huggingface.co/mradermacher high-quality quant set
HuggingFace lmstudio-community huggingface.co/lmstudio-community LM Studio-curated GGUFs
LM Studio lmstudio.ai/models desktop catalog
llama.cpp github.com/ggerganov/llama.cpp/discussions empirical VRAM/throughput data
MLX (Apple Silicon) huggingface.co/mlx-community mlx-quantized variants
vLLM docs.vllm.ai/en/latest/models/supported_models server-side support matrix
sglang github.com/sgl-project/sglang structured-output throughput
llmfit github.com/AlexsJones/llmfit 148-model HF curated DB (mirrored locally)

S-tier — Vendor official sources

These are the canonical announcements, model cards, pricing pages, and API docs from each model's maker. Used as primary source for release dates, license, context window, and API pricing — and as cross-check against independent benchmarks.

Vendor Sources
Anthropic anthropic.com/news · docs.claude.com · pricing
OpenAI openai.com/blog · platform.openai.com/docs/models
Google DeepMind deepmind.google/discover · ai.google.dev/gemini-api/docs/models
Mistral mistral.ai/news · docs.mistral.ai
DeepSeek deepseek.com/news · api-docs.deepseek.com
xAI x.ai/news · docs.x.ai/docs/models
Alibaba (Qwen) qwenlm.github.io/blog · qwen-lm.github.io
Moonshot (Kimi) kimi.com/blog · platform.moonshot.cn
Z.ai (GLM) z.ai/news · docs.z.ai
Xiaomi (MiMo) mimo.xiaomi.com · xiaomimimo.github.io
MiniMax minimaxi.com/news · platform.minimaxi.com
Nvidia build.nvidia.com · blogs.nvidia.com
Meta (Llama) huggingface.co/meta-llama · ai.meta.com/blog
Google (Gemma) huggingface.co/google · ai.google.dev/gemma
StepFun stepfun.com
All Hands AI (Devstral) all-hands.dev

C-tier — Aggregators & expert commentary (used only when I/S sources absent)

ApiDog Blog · The Decoder · DataCamp Blog · Build Fast With AI · Simon Willison · Latent Space · Swyx · Awesome-LLM · Awesome-Efficient-LLM · r/LocalLLaMA (community VRAM reports only) · Design Arena (UI auxiliary)

Trust hierarchy at a glance

trustScore = tierWeight × min(verifications, 3)/3 × recencyDecay(date)

  I-tier (independent)  weight 1.0
  S-tier (vendor)       weight 0.7
  C-tier (community)    weight 0.4
  U-tier (forum/social) weight 0.1   ← never written, cross-check only

  recency: <30d=1.0 · <90d=0.85 · <180d=0.70 · <365d=0.50 · ≥365d=0.30

When two sources disagree on a value, the one with the higher trustScore wins. The losing value still appears in data/sources.json with its tier and score, so you can audit every decision.


🤝 Contributing

Currently pre-launch / solo development. Issues and discussions will open in Phase 2. For now:

  • ⭐ Star the repo to follow progress
  • 🐛 Open an issue for benchmark-data corrections (after launch)
  • 💡 Open a discussion for feature requests

📜 License

MIT — see LICENSE. Code and data are public; attribution appreciated; no takedown power (public benchmark data).


🧠 Built with

A reusable Claude Code skill + research-agent template — domain-agnostic, cloneable to other tracker projects.

Author: Sungur Erdim · sungurerdim@gmail.com

About

Compare AI coding models with coding-focused benchmarks weighted your way — cross-verified, contradiction-flagged, in Turkish and English

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors