PluginEval is a three-layer quality evaluation framework for Claude Code plugins and skills. It combines deterministic static analysis, LLM-based semantic judging, and Monte Carlo simulation to produce calibrated quality scores with confidence intervals.
PluginEval answers the question: "How good is this plugin or skill?" It evaluates across 10 quality dimensions, detects anti-patterns, assigns letter grades, and awards quality badges (Bronze through Platinum).
┌─────────────────────────────────────────────────┐
│ CLI / Commands │
│ score · certify · compare · init │
├─────────────────────────────────────────────────┤
│ Eval Engine │
│ Composite scoring, layer blending │
├────────────┬────────────────┬───────────────────┤
│ Layer 1 │ Layer 2 │ Layer 3 │
│ Static │ LLM Judge │ Monte Carlo │
│ Analysis │ (Semantic) │ (Statistical) │
│ <2s, free │ ~30s, 4 calls │ ~2min, 50 calls │
├────────────┴────────────────┴───────────────────┤
│ Parser Layer │
│ SKILL.md, agents/*.md, plugin.json │
├─────────────────────────────────────────────────┤
│ Statistical Methods │
│ Wilson CI · Bootstrap CI · Clopper-Pearson │
│ Cohen's κ · Coefficient of Variation │
├─────────────────────────────────────────────────┤
│ Corpus & Elo Ranking │
│ Gold standard index · Pairwise comparison │
└─────────────────────────────────────────────────┘
PluginEval lives in plugins/plugin-eval/ and uses uv for dependency management.
cd plugins/plugin-eval
# Install core dependencies (static analysis only)
uv sync
# Install with LLM support (Layers 2 & 3)
uv sync --extra llm
# Install with direct API support
uv sync --extra api
# Install dev dependencies (tests, linting)
uv sync --extra dev- Python ≥ 3.12
- Core:
pydantic,typer,rich,pyyaml - LLM layers:
claude-agent-sdk(uses Claude Code Max plan by default) - API alternative:
anthropicSDK (requiresANTHROPIC_API_KEY)
# Quick evaluation (static only, instant)
uv run plugin-eval score path/to/skill --depth quick
# Standard evaluation (static + LLM judge)
uv run plugin-eval score path/to/skill --depth standard
# Deep evaluation (all three layers)
uv run plugin-eval score path/to/skill --depth deep
# Output formats
uv run plugin-eval score path/to/skill --output json
uv run plugin-eval score path/to/skill --output markdown
uv run plugin-eval score path/to/skill --output html
# CI gate: exit code 1 if below threshold
uv run plugin-eval score path/to/skill --threshold 70Options:
| Option | Default | Description |
|---|---|---|
--depth |
standard |
quick, standard, deep, thorough |
--output |
markdown |
json, markdown, html |
--verbose |
false |
Show detailed output |
--concurrency |
4 |
Max concurrent LLM calls (1–20) |
--auth |
max |
Auth mode: max (Claude Code Max plan) or api-key |
--threshold |
none | Minimum score; exit 1 if below |
Runs at deep depth (all three layers). Takes 15–20 minutes.
uv run plugin-eval certify path/to/skill --output markdownCompare two skills side-by-side across all dimensions.
uv run plugin-eval compare path/to/skill-a path/to/skill-bBuild a gold-standard corpus index from a plugins directory for Elo ranking.
uv run plugin-eval init plugins/ --corpus-dir ~/.plugineval/corpusPluginEval is also a Claude Code plugin with agents and commands.
| Command | Description |
|---|---|
/eval <path> |
Evaluate a plugin or skill (orchestrates static + judge) |
/certify <path> |
Full certification pipeline with badge |
/compare <a> <b> |
Head-to-head skill comparison |
| Agent | Model | Role |
|---|---|---|
eval-orchestrator |
Opus | Coordinates evaluation: runs CLI, dispatches judge, computes composite |
eval-judge |
Sonnet | LLM judge: scores 4 semantic dimensions with anchored rubrics |
The evaluation-methodology skill provides the full scoring methodology reference, including dimension definitions, rubric anchors, blend weights, and improvement guidance.
Speed: < 2 seconds. Cost: Free (no LLM calls). Deterministic.
Runs six structural sub-checks against the parsed SKILL.md:
| Sub-check | Weight | What it measures |
|---|---|---|
frontmatter_quality |
35% | Name, description length, trigger-phrase quality ("Use when…", "Use PROACTIVELY") |
orchestration_wiring |
25% | Output/input documentation, code examples, orchestrator anti-pattern |
progressive_disclosure |
15% | Line count vs. sweet spot (200–600 lines), references/ and assets/ directories |
structural_completeness |
10% | Heading density, code blocks, examples section, troubleshooting section |
token_efficiency |
10% | MUST/NEVER/ALWAYS density, duplicate-line detection |
ecosystem_coherence |
5% | Cross-references to other skills/agents, "related"/"see also" mentions |
Also detects anti-patterns (see below) and applies a multiplicative penalty.
Speed: ~30 seconds. Cost: 4 LLM calls (Haiku + Sonnet). Requires claude-agent-sdk.
Uses Claude as a semantic evaluator across 4 dimensions with anchored rubrics:
| Dimension | Model | Method |
|---|---|---|
triggering_accuracy |
Haiku | Generates 10 synthetic prompts (5 should-trigger, 5 should-not), computes F1 |
orchestration_fitness |
Sonnet | Rates worker-vs-orchestrator role using 5-point anchored rubric |
output_quality |
Sonnet | Simulates 3 realistic tasks, evaluates expected output quality |
scope_calibration |
Sonnet | Rates scope appropriateness using 5-point anchored rubric |
All 4 assessments run concurrently with semaphore-based throttling.
Speed: ~2 minutes (50 runs) to ~5 minutes (100 runs). Cost: 50–100 LLM calls. Requires claude-agent-sdk.
Generates 15 varied prompts via Haiku, then runs N simulations to compute statistical reliability:
| Metric | Measure | Statistical Method |
|---|---|---|
| Activation rate | % of runs where skill activated | Wilson score CI |
| Output consistency | Mean quality + coefficient of variation | Bootstrap CI (1000 resamples) |
| Failure rate | % of runs that errored | Clopper-Pearson exact CI |
| Token efficiency | Median tokens, IQR, outlier detection | Normalized against 8000-token cap |
| Depth | Layers | Confidence Label | Time | Cost |
|---|---|---|---|---|
quick |
Static only | Estimated | < 2s | Free |
standard |
Static + Judge | Assessed | ~30s | 4 LLM calls |
deep |
Static + Judge + Monte Carlo (50 runs) | Certified | ~3 min | ~54 LLM calls |
thorough |
Static + Judge + Monte Carlo (100 runs) | Certified+ | ~6 min | ~104 LLM calls |
Each dimension has a weight and receives scores from different layers, blended using per-dimension weights:
| Dimension | Weight | Static | Judge | Monte Carlo | What it measures |
|---|---|---|---|---|---|
triggering_accuracy |
25% | 0.15 | 0.25 | 0.60 | Does the description fire for the right prompts? |
orchestration_fitness |
20% | 0.10 | 0.70 | 0.20 | Is it a composable worker, not an orchestrator? |
output_quality |
15% | 0.00 | 0.40 | 0.60 | Would it produce correct, useful output? |
scope_calibration |
12% | 0.30 | 0.55 | 0.15 | Is the scope well-sized for its domain? |
progressive_disclosure |
10% | 0.80 | 0.20 | 0.00 | Does it use references/ for large content? |
token_efficiency |
6% | 0.40 | 0.10 | 0.50 | Is it concise without repetition? |
robustness |
5% | 0.00 | 0.20 | 0.80 | Does it handle varied inputs reliably? |
structural_completeness |
3% | 0.90 | 0.10 | 0.00 | Does it have headings, code, examples? |
code_template_quality |
2% | 0.30 | 0.70 | 0.00 | Are code examples production-ready? |
ecosystem_coherence |
2% | 0.85 | 0.15 | 0.00 | Does it link to related skills/agents? |
Final = Σ(dimension_weight × blended_score) × 100 × anti_pattern_penalty
Where blended_score for each dimension is a weighted combination of available layer scores, renormalized to the layers actually present.
| Badge | Score | Elo | Stars | Meaning |
|---|---|---|---|---|
| Platinum | ≥ 90 | ≥ 1600 | ★★★★★ | Reference quality |
| Gold | ≥ 80 | ≥ 1500 | ★★★★ | Production ready |
| Silver | ≥ 70 | ≥ 1400 | ★★★ | Functional, needs polish |
| Bronze | ≥ 60 | ≥ 1300 | ★★ | Minimum viable |
Badges require both score AND Elo thresholds when Elo data is available.
Scores are also converted to letter grades:
| Grade | Score Range |
|---|---|
| A+ | ≥ 97 |
| A | ≥ 93 |
| A- | ≥ 90 |
| B+ | ≥ 87 |
| B | ≥ 83 |
| B- | ≥ 80 |
| C+ | ≥ 77 |
| C | ≥ 73 |
| C- | ≥ 70 |
| D+ | ≥ 67 |
| D | ≥ 63 |
| D- | ≥ 60 |
| F | < 60 |
The static analyzer detects these anti-patterns, each with a severity that contributes to a multiplicative penalty:
| Flag | Severity | Trigger |
|---|---|---|
OVER_CONSTRAINED |
10% | > 15 MUST/ALWAYS/NEVER directives |
EMPTY_DESCRIPTION |
10% | Description < 20 characters |
MISSING_TRIGGER |
15% | No "Use when…" trigger phrase in description |
BLOATED_SKILL |
10% | > 800 lines without a references/ directory |
ORPHAN_REFERENCE |
5% | Dead link to a file in references/ |
DEAD_CROSS_REF |
5% | Cross-reference to a non-existent skill/agent |
Penalty formula: penalty = max(0.5, 1.0 − 0.05 × count) — each anti-pattern reduces the score by 5%, flooring at 50%.
For relative quality comparison against a corpus of known skills:
- Initial rating: 1500
- K-factor: 32
- Confidence intervals: Bootstrap resampling (500 resamples)
- Corpus management:
initcommand indexes all skills from a plugins directory - Reference selection: Matches by category and similar line count
The Elo system uses the standard formula: E(A) = 1 / (1 + 10^((Rb - Ra) / 400)).
The corpus is a JSON index of all skills used for Elo comparisons:
# Build corpus from your plugins directory
uv run plugin-eval init plugins/ --corpus-dir ~/.plugineval/corpus
# The corpus stores:
# - Skill name, path, category, line count
# - Current Elo rating (updated after each comparison)Reference skills are selected by matching category and approximate line count.
PluginEval uses rigorous statistical methods throughout:
| Method | Used For | Details |
|---|---|---|
| Wilson score CI | Activation rate confidence | Handles small-sample binomial proportions |
| Bootstrap CI | Output quality confidence | 1000 resamples, percentile method |
| Clopper-Pearson | Failure rate confidence | Exact CI for small failure counts |
| Coefficient of variation | Output consistency | std/mean ratio; lower = more consistent |
| Cohen's kappa | Inter-rater agreement | For multi-judge scenarios |
All statistical functions are pure Python with no external dependencies (no scipy/numpy required).
The parser extracts structured data from Claude Code plugin files:
- Skills: Parses SKILL.md frontmatter (name, description), counts headings, code blocks, languages, MUST/NEVER/ALWAYS directives, cross-references, and detects references/ and assets/ directories
- Agents: Parses agent .md frontmatter (name, description, model, tools), detects proactive triggers and skill references
- Plugins: Aggregates all skills and agents from a plugin directory
plugins/plugin-eval/
├── .claude-plugin/
│ └── plugin.json # Claude Code plugin manifest
├── agents/
│ ├── eval-orchestrator.md # Orchestrates evaluation (Opus)
│ └── eval-judge.md # LLM judge agent (Sonnet)
├── commands/
│ ├── eval.md # /eval slash command
│ ├── certify.md # /certify slash command
│ └── compare.md # /compare slash command
├── skills/
│ └── evaluation-methodology/
│ ├── SKILL.md # Full methodology reference
│ └── references/
│ └── rubrics.md # Detailed rubric anchors
├── src/plugin_eval/
│ ├── __init__.py
│ ├── cli.py # Typer CLI (score, certify, compare, init)
│ ├── engine.py # Eval engine (layer coordination, composite scoring)
│ ├── models.py # Pydantic models (Depth, Badge, EvalConfig, results)
│ ├── parser.py # Plugin/skill/agent parser
│ ├── reporter.py # JSON/Markdown/HTML output
│ ├── corpus.py # Gold standard corpus for Elo ranking
│ ├── elo.py # Elo rating calculator with bootstrap CI
│ ├── stats.py # Statistical methods (Wilson, bootstrap, Clopper-Pearson)
│ └── layers/
│ ├── __init__.py
│ ├── static.py # Layer 1: deterministic structural analysis
│ ├── judge.py # Layer 2: LLM semantic evaluation
│ └── monte_carlo.py # Layer 3: statistical reliability simulation
├── tests/ # Comprehensive test suite
│ ├── conftest.py
│ ├── test_cli.py
│ ├── test_engine.py
│ ├── test_static.py
│ ├── test_judge.py
│ ├── test_monte_carlo.py
│ ├── test_models.py
│ ├── test_parser.py
│ ├── test_reporter.py
│ ├── test_corpus.py
│ ├── test_elo.py
│ ├── test_stats.py
│ └── test_e2e.py # End-to-end tests against real plugins
├── pyproject.toml # uv/hatch project config
└── uv.lock
cd plugins/plugin-eval
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=plugin_eval
# Run specific test file
uv run pytest tests/test_static.py
# Run e2e tests (requires real plugin corpus)
uv run pytest tests/test_e2e.py# PluginEval Report
**Path:** `plugins/python-development/skills/async-python-patterns`
**Timestamp:** 2025-03-26T12:00:00+00:00
**Depth:** standard
## Overall Score
| Metric | Value |
|--------|-------|
| Score | **78.3/100** |
| Confidence | Assessed |
| Badge | Silver |
## Layer Breakdown
| Layer | Score | Anti-Patterns |
|-------|-------|---------------|
| static | 0.742 | 0 |
| judge | 0.811 | 0 |
## Dimension Scores
| Dimension | Weight | Score | Grade |
|-----------|--------|-------|-------|
| Triggering Accuracy | 25% | 0.850 | B |
| Orchestration Fitness | 20% | 0.780 | C+ |
| Output Quality | 15% | 0.820 | B- |
| Scope Calibration | 12% | 0.750 | C |
| Progressive Disclosure | 10% | 0.600 | D- |
| Token Efficiency | 6% | 0.910 | A- |
| ...