Live, open-source benchmark for comparing AI coding agents on real GitHub issues
⭐ Star this repo to bookmark — fresh data every 15 minutes
A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.
This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.
⏰ Last updated: 2026-05-25 17:45 UTC
Data source:
GitHub Search APIThe table below is rewritten on every cron tick. Star the repo to bookmark.
| # | Name | ⭐ | Lang | Updated | Description |
|---|---|---|---|---|---|
| 1 | promptfoo/promptfoo | 21584 | TypeScript | 2026-05-25 | Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C |
| 2 | MyForgeLabs/myforge-vault-1111 | 0 | HTML | 2026-05-25 | An open-source 8-axis methodology + working tooling for evolving a personal Obsidian-vault into a self-improving knowled |
| 3 | isoc-il-labs/agent-reliability-eval | 0 | HTML | 2026-05-25 | Framework for evaluating the accuracy & reliability of news-credibility agents — QA'd EN/HE test units + harness |
| 4 | Arize-ai/phoenix | 9826 | Python | 2026-05-25 | AI Observability & Evaluation |
| 5 | saddled-panicattack529/idea-evaluation-pipeline | 0 | — | 2026-05-25 | Streamline research idea evaluation for finance and economics to reach top journal quality using an iterative, AI-assist |
| 6 | Kondwani10/Origin-Continuum | 0 | — | 2026-05-25 | 🌐 Define and explore the Origin ↔ Continuum framework, ensuring proper attribution and continuity in dependency relation |
| 7 | Kamixon131/claude-config | 0 | — | 2026-05-25 | ⚙️ Enhance Claude Code with a powerful configuration framework that features specialized agents and workflows for effici |
| 8 | Sans-cell-art/-Project-Phoenix-The-E-Waste-Supercomputer- | 0 | — | 2026-05-25 | ♻️ Transform e-waste into a powerful, low-cost cloud operating system, unlocking computing potential and promoting resou |
| 9 | bhavya7995/AI_governance | 1 | PowerShell | 2026-05-25 | 🤖 Streamline AI-assisted development with a governance kit for rules, enforcement, and decision-making, ensuring speed a |
| 10 | Phinchanbora/llm-evaluation | 0 | Python | 2026-05-25 | 🎯 Benchmark LLMs effectively with over 10 tests and 108,000 real questions to assess model performance and enhance AI ev |
| 11 | penpoen/llm-SugarScape | 1 | Python | 2026-05-25 | 🌐 Explore AI behaviors in a Sugarscape simulation, revealing insights into cooperation and survival instincts using Grok |
| 12 | homemade-software-inc/completion-kit | 1 | Ruby | 2026-05-25 | Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and c |
| 13 | Giskard-AI/giskard-oss | 5366 | Python | 2026-05-25 | 🐢 Open-Source Evaluation & Testing library for LLM Agents |
| 14 | matt-rachlin/agent-eval-harness | 0 | Python | 2026-05-24 | Online evals on live agent traces. Open-source, self-hostable, OpenTelemetry-native eval harness with regression detecti |
| 15 | harnexa/nexa-gauge | 36 | Python | 2026-05-25 | An graph-eval framework for LLM's |
| 16 | jeremylongshore/j-rig-skill-binary-eval | 0 | TypeScript | 2026-05-24 | Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score e |
| 17 | verifywise-ai/verifywise | 287 | TypeScript | 2026-05-25 | Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI framewo |
| 18 | reaatech/rag-eval-pack | 0 | TypeScript | 2026-05-23 | RAG evaluation toolkit — faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with |
| 19 | truera/trulens | 3342 | Python | 2026-05-23 | Evaluation and Tracking for LLM Experiments and AI Agents |
| 20 | Mike-E-Log/learn-ai-eval | 0 | HTML | 2026-05-22 | The Eval Codex — Claude-tutored AI-eval learning engine. Build eval expertise via guided practice. |
| 21 | NoesisVision/nasde-toolkit | 10 | Python | 2026-05-25 | CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini ind |
| 22 | chquandogong/mission-spec | 0 | TypeScript | 2026-05-25 | Mission Spec — AI 에이전트 워크플로를 위한 task contract layer |
| 23 | sanya2025/edututor-eval | 0 | Python | 2026-05-21 | A lightweight evaluation framework for AI tutoring responses, built for education-focused LLM systems |
| 24 | reaatech/hybrid-rag-qdrant | 1 | TypeScript | 2026-05-20 | Serious hybrid RAG reference — vector + BM25 + reranker over Qdrant, chunking strategies benchmarked, eval set included, |
| 25 | reaatech/classifier-evals | 0 | TypeScript | 2026-05-20 | Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regressio |
| 26 | Alexanderk30/context-override-resistance | 0 | Python | 2026-05-19 | RL-style eval measuring intent/action divergence in frontier agents: model acknowledges a correction, then acts on the s |
| 27 | melody-ling-L/eval-resume | 0 | HTML | 2026-05-19 | 第一个聚焦"简历改写诚实度"的中文 LLM benchmark:20 真实脱敏简历 × 3 模型 × 4 评分维度 |
| 28 | melody-ling-L/judgebuddy | 0 | HTML | 2026-05-19 | Single-file labeling tool for LLM-as-judge calibration. Three-pane comparison + multi-dim scoring. Zero deployment. |
| 29 | GiuseppeSp/n8n-customer-interview-synthesizer | 0 | — | 2026-05-19 | Multi-agent customer-interview synthesis pipeline in n8n with LLM-as-judge eval, Slack human-in-the-loop approval, and d |
| 30 | reaatech/agent-eval-harness | 0 | TypeScript | 2026-05-20 | End-to-end agent evaluation — trajectory eval, tool-use correctness, cost-per-task, latency budgets, regression suites w |
| 31 | ajmeese7/local-llms | 1 | Python | 2026-05-18 | Use local Large Language Models for production use cases, and perform benchmarking for task-specific performance evaluat |
| 32 | monkeyin92/voice-agent-testops | 0 | TypeScript | 2026-05-18 | Regression testing for voice agents: scripted conversations, safety assertions, CI-ready reports. |
| 33 | gmitt98/fieldtest | 0 | Python | 2026-05-16 | LLM evaluation framework — define what correct, well-formed, and safe means before you measure |
| 34 | verifywise-ai/plugin-marketplace | 3 | TypeScript | 2026-05-15 | VerifyWise AI Governance Plugin Marketplace |
| 35 | AI-QL/tuui | 1146 | TypeScript | 2026-05-14 | A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context |
| 36 | prompt-foundry/typescript-sdk | 6 | TypeScript | 2026-05-13 | The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS. |
| 37 | prompt-foundry/python-sdk | 8 | Python | 2026-05-13 | The prompt engineering, prompt management, and prompt evaluation tool for Python |
| 38 | mizcausevic-dev/agent-eval-arena | 0 | TypeScript | 2026-05-12 | Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, |
| 39 | fastxyz/skill-optimizer | 57 | TypeScript | 2026-05-25 | Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs |
| 40 | rogue-socket/focusgroup | 0 | Python | 2026-05-11 | Persona-driven dynamic testing for conversational AI products. Focus groups for your agents. |
| 41 | Ruthwik-Data/mechanictrust | 0 | — | 2026-05-11 | AI product case study for trust, pricing transparency, and explainable diagnosis in auto repair. |
| 42 | SAY-5/eval-observability | 0 | Python | 2026-05-10 | Python LLM eval framework with full OTel tracing, structured logs, and daily Welch's-t-test regression detection persist |
| 43 | Ruthwik-Data/finrag-eval | 0 | Python | 2026-05-10 | RAG eval pipeline on Apple's FY 2024 10-K — found confident hallucinations, filed a metric-level bug in DeepEval, and bu |
| 44 | Ruthwik-Data/self-improving-prompt-agent | 0 | Python | 2026-05-10 | Prompt optimization loop that improves prompts through iterative mutation and LLM-as-judge evaluation. Score went 0.10 → |
| 45 | SAY-5/genai-eval | 0 | Python | 2026-05-07 | Multilingual GenAI evaluation service across 5 task types and 3 languages, with regression-trend dashboard |
| 46 | HumphreySun98/repoagentbench | 31 | Python | 2026-04-30 | SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: cla |
| 47 | YagneshKhamar/phasio | 0 | TypeScript | 2026-04-29 | Jest-style testing for LLM prompts. Version prompts, run evals across OpenAI and Anthropic, catch regressions in CI. |
| 48 | lehigh-university-libraries/htr | 2 | Go | 2026-05-22 | Handwritten Text Recognition llm eval tool |
| 49 | JSLEEKR/evaltrack | 0 | TypeScript | 2026-04-24 | Local-first regression and trend CLI for promptfoo eval histories — the git log + git diff for LLM eval outputs. |
| 50 | izam-mohammed/ragrank | 47 | Python | 2026-04-21 | 🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, an |
| 51 | arthursoares/openclaw-llm-bench | 2 | Python | 2026-04-11 | A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-j |
| 52 | YuanyangLiNEU/mini-claude | 0 | TypeScript | 2026-04-11 | A minimal Claude Code built from scratch — agent loop, tool calling, web search, permissions, and a black-box LLM eval h |
| 53 | webrenew/models-dilemma | 4 | TypeScript | 2026-04-08 | The Prisoner's Dilemma played by LLMs |
| 54 | AdirAmsalem/openclaw-eval | 0 | Python | 2026-03-31 | Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, la |
| 55 | Data-ScienceTech/forcefield | 1 | Python | 2026-03-30 | ForceField Python SDK -- AI security in 3 lines of code. Prompt injection detection, PII redaction, security evals, tool |
| 56 | alyssadata/continuity-keys | 1 | — | 2026-03-29 | Continuity Keys: tests for “same someone” returns. Behavioral identity consistency under pressure. Origin (Alyssa Solen) |
| 57 | klausners/prompt-optimizer | 0 | TypeScript | 2026-03-26 | Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evalua |
| 58 | Aysnc-Labs/llm-eval | 1 | PHP | 2026-03-20 | A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correc |
| 59 | asarnaout/veritail | 6 | Python | 2026-03-15 | LLM-as-a-Judge evaluation platform for ecommerce search. Scores relevance, computes IR metrics, and flags quality issues |
| 60 | vola-trebla/llm-infrastructure | 0 | — | 2026-03-14 | Full-stack AI infrastructure - 5 projects from data ingestion to autonomous agents |
| 61 | whitecircle/circle-guard-bench | 69 | Python | 2026-03-07 | First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (g |
| 62 | tpertner/squeeze | 5 | Python | 2026-03-01 | Squeeze your model with pressure prompts to see if its behavior leaks. |
| 63 | grigio/llm-eval-simple | 68 | Python | 2026-02-28 | llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection |
| 64 | QuesmaOrg/BinaryAudit | 90 | Shell | 2026-02-27 | An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries. |
| 65 | paradime-io/dbt-llm-evals | 27 | Python | 2026-02-10 | The warehouse-native LLM evaluation package for dbt™ - monitor AI quality without data egress |
| 66 | Striveworks/valor | 41 | Python | 2026-02-09 | Valor is a lightweight, numpy-based library designed for fast and seamless evaluation of machine learning models. |
| 67 | TADSTech/llm-output-grader | 0 | Python | 2026-01-24 | systematic llm grading |
| 68 | 3ahmood/Agentic-Author-CrewAI | 1 | Jupyter Notebook | 2026-01-15 | On device autonomous research and content writing using open-sourced LLMs and Crew AI. |
| 69 | Supahands/llm-comparison-backend | 22 | Python | 2026-01-13 | This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be r |
| 70 | thedataquarry/structured-outputs | 28 | Python | 2025-12-23 | Structured output benchmarks comparing DSPy and BAML with different LLMs |
| 71 | higuseonhye/worldsim-eval | 0 | — | 2025-12-20 | Evaluate AI agents by simulating world-level consequences. |
| 72 | yukincom/llm-SugarScape | 6 | Python | 2025-11-28 | Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in |
| 73 | IAAR-Shanghai/GuessArena | 10 | Python | 2025-11-15 | [ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Re |
| 74 | iltutishrak/eval-metrics-lab | 0 | Python | 2025-11-10 | Text-only playground for evaluating reasoning model outputs with mock accuracy, hallucination, and trust metrics — runs |
| 75 | artefactop/promptdev | 2 | Python | 2025-09-22 | A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers. |
| 76 | multinear/multinear | 45 | Python | 2025-09-02 | Develop reliable AI apps |
| 77 | attogram/ollama-multirun | 16 | Shell | 2025-08-30 | Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance stat |
| 78 | khoj-ai/llm-coup | 12 | TypeScript | 2025-08-18 | Let LLMs play coup with each other and see who's the best at deception & strategy |
| 79 | jaaack-wang/multi-problem-eval-llm | 3 | Jupyter Notebook | 2025-08-08 | Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities |
| 80 | alan-turing-institute/prompto | 37 | Python | 2025-07-18 | An open source library for asynchronous querying of LLM endpoints |
| 81 | athina-ai/athina-evals | 300 | Python | 2025-06-06 | Python SDK for running evaluations on LLM generated responses |
| 82 | amplifying-ai/ai-product-bench | 22 | HTML | 2025-05-27 | |
| 83 | regankight/mirror-model-eval-tests | 0 | — | 2025-05-17 | LLM behavior QA: tone collapse, false consent, and reroute logic scoring. |
| 84 | pyladiesams/eval-llm-based-apps-jan2025 | 8 | Jupyter Notebook | 2025-05-06 | Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundatio |
| 85 | daqh/llm-eval | 0 | Python | 2025-03-24 | This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational co |
| 86 | parea-ai/parea-sdk-py | 82 | Python | 2025-02-13 | Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23) |
| 87 | parea-ai/parea-sdk-ts | 4 | TypeScript | 2025-01-17 | TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23) |
| 88 | yukinagae/genkitx-promptfoo | 7 | TypeScript | 2025-01-03 | Community Plugin for Genkit to use Promptfoo |
| 89 | honeyhiveai/realign | 19 | Python | 2024-12-04 | Realign is a testing and simulation framework for AI applications. |
| 90 | harlev/eva-l | 5 | Python | 2024-11-27 | LLM Evaluation Framework |
| 91 | genia-dev/vibraniumdome | 27 | Python | 2024-10-28 | LLM Security Platform. |
| 92 | Human-Centric-Machine-Learning/prediction-powered-ranking | 9 | Jupyter Notebook | 2024-10-28 | Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024. |
| 93 | yuzu-ai/ShinRakuda | 3 | Python | 2024-09-17 | Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering resear |
| 94 | yukinagae/genkit-promptfoo-sample | 0 | TypeScript | 2024-09-11 | Sample implementation demonstrating how to use Firebase Genkit with Promptfoo |
| 95 | yukinagae/promptfoo-sample | 2 | — | 2024-09-10 | Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models |
| 96 | uptrain-ai/uptrain | 2350 | Python | 2024-08-18 | UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ |
| 97 | prompt-foundry/dotnet-sdk | 0 | — | 2024-06-16 | The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET |
| 98 | prompt-foundry/ruby-sdk | 1 | — | 2024-06-16 | The prompt engineering, prompt management, and prompt evaluation tool for Ruby. |
| 99 | prompt-foundry/kotlin-sdk | 0 | — | 2024-06-16 | The prompt engineering, prompt management, and prompt evaluation tool for Kotlin. |
| 100 | prompt-foundry/go-sdk | 1 | — | 2024-06-16 | The prompt engineering, prompt management, and prompt evaluation tool for Go. |
Every 15 minutes, a GitHub Action runs tracker.py. That script:
- Fetches the latest state from
GitHub Search API. - Diffs against
data/items.json(the previous snapshot). - Rewrites the table above between the
<!-- TRACKER_TABLE_* -->markers. - Commits
feat: +N added, -M removed (timestamp)if anything changed.
No external services. No paid APIs. Just a public data source and a free GitHub Action.
See CONTRIBUTING.md — usually you don't need to: the tracker keeps itself current.
If you spot a data-source bug or want to suggest a new column for the table, open
an issue.
If you find this useful, you might also like these other auto-updated trackers from the same maintainer — same mechanism, different upstream:
- trending-claude-skills — What's shipping in Claude Skills this week (
topic:claude-skills) - mcp-servers-live — Live index of newest MCP servers (
topic:mcp-server) - cursor-rules-live — Newest Cursor rules and .cursorrules patterns (
topic:cursor-rules) - claude-code-plugin-tracker — Claude Code plugins and hook configs (
topic:claude-code) - llm-agents-radar — Newest LLM agent frameworks (
topic:llm-agent) - rag-radar — Newest RAG implementations and tools (
topic:rag) - llm-eval-tracker — Newest LLM evaluation tools and benchmarks (
topic:llm-eval) - agent-framework-radar — Newest agent frameworks shipping on GitHub (
topic:agent-framework) - vector-db-live — Newest vector DB projects and integrations (
topic:vector-database) - llmops-radar — Newest LLMOps tooling (observability, deployment) (
topic:llmops) - prompt-tools-live — Newest prompt-engineering tools and prompt repos (
topic:prompt-engineering) - skills-tracker — Tracking new GitHub 'skills' repos (
topic:agent-skills) - awesome-agent-skills — Curated auto-updated awesome-list of AI agent skills (
topic:agent-skills)
MIT — see LICENSE.
-
Awesome Agent Skills — Curated, auto-updated awesome-list of vetted AI agent skills with quality ratings for Claude, GPT, and open-source agents (⭐ 0)
-
Agent Skills Daily Tracker — Real-time tracking of every new GitHub 'skills' repo to capture the AI agent skill ecosystem trend (⭐ 0)
-
Agent Eval Harness — Live, open-source benchmark for comparing AI coding agents on real GitHub issues (⭐ 0)
-
Prompt Tools Live — Live-updating tracker of prompt engineering tools, libraries, and techniques — refreshed every 15 minutes (⭐ 0)
-
LLMOps Radar — Live index of the newest LLMOps tooling — track what's shipping in LLM observability and deployment (⭐ 0)