Skip to content

The credit score for AI agents. 113-question diagnostic, 69 scoring modules, 66 adversarial attack packs, 4064 tests. Evidence-gated trust scoring with cryptographic proof chains.

License

Notifications You must be signed in to change notification settings

thewisecrab/AgentMaturityCompass

Repository files navigation

🧭 Agent Maturity Compass (AMC)

The credit score for AI agents.

🧭 AMC Score: 3.7 / 5.0 — Defined
   Strategic Agent Ops ····· 3.2  (17 questions)
   Skills ·················· 4.1  (35 questions)
   Resilience ·············· 3.8  (30 questions)
   Leadership & Autonomy ··· 3.5  (21 questions)
   Culture & Alignment ····· 3.9  (23 questions)
   Evidence: ✓ Merkle root 9c4e…a7f0 (Ed25519)

The 84-Point Lie

Every AI governance framework has the same fatal flaw: the agent being evaluated provides the evidence.

Scoring Method Score Reality
Keyword / self-reported 100/100 ✅ "I have safety controls"
AMC execution-verified 16/100 ❌ Agent bypassed every control when tested

That's an 84-point documentation inflation gap. AMC closes it with cryptographic evidence chains that can't be faked.

Get Started (2 minutes)

npm i -g agent-maturity-compass
mkdir my-agent && cd my-agent
amc init

That's it. amc init walks you through your first score interactively.

Share your score — paste a badge into your README:

amc quickscore --share   # markdown summary + next-level action plan
amc badge                # ![AMC L3](https://img.shields.io/badge/AMC-L3%20Defined-blue)

📖 Full guide: install → first score → L5

Tests License Questions Modules Attack Packs Adapters Sector Packs 🌐 Website · 📖 Docs · 💬 Discussions


How It Works

Agent (untrusted) → AMC Gateway (trusted observer) → Evidence Ledger (signed, hash-chained)
                                                              ↓
                                                Scoring Engine (138 questions, 5 dimensions)
                                                              ↓
                                               AMC Studio (dashboard + API)

Not all evidence is equal — AMC weights by trust tier:

Tier Weight Source
OBSERVED_HARDENED 1.1× AMC-controlled traces with stronger context
OBSERVED 1.0× Directly observed via AMC gateway
ATTESTED 0.8× Cryptographic attestation via vault/notary
SELF_REPORTED 0.4× Agent claims — capped, cannot inflate maturity

Works With Any Agent

amc wrap claude -- claude "analyze this"           # Claude
amc wrap gemini -- gemini chat                     # Gemini
amc adapters run --adapter generic-cli -- python bot.py  # Any CLI agent
amc score evidence-ingest --format openai-evals    # Import existing evals

14 framework adapters: LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex, Semantic Kernel, Claude Code, Gemini, OpenClaw, OpenHands, and more.


Agent Guide — Guardrails From Your Score

AMC doesn't just score — it generates operational guardrails and applies them directly to your agent's config file.

# One command: detect framework → generate guardrails → apply to config
amc guide --go
  • 10 frameworks auto-detected from project files (pyproject.toml, package.json, config files)
  • 15 config targets — AGENTS.md, CLAUDE.md, .cursorrules, .kiro/steering, .gemini/style.md, and more
  • Severity-tagged — 🔴 Critical, 🟡 High, 🔵 Medium — so you know what to fix first
  • Idempotent — re-running --apply updates only the guardrails section (AMC-GUARDRAILS markers)
  • CI gateamc guide --ci --target 3 exits non-zero if below threshold
  • Complianceamc guide --compliance EU_AI_ACT maps gaps to regulatory obligations (EU AI Act, ISO 42001, NIST AI RMF, SOC 2, ISO 27001)
amc guide --status              # One-line health check
amc guide --interactive         # Cherry-pick which gaps to fix
amc guide --watch --apply       # Continuous monitoring + auto-update
amc guide --diff                # What improved since last run

📖 Full guide system docs


Sector Packs — Enterprise-Grade Vertical Assessment

AMC ships with 40 industry-specific assessment packs covering regulated sectors, critical infrastructure, and public institutions. Each pack adds precise, sub-vertical questions on top of the base AMC rubric.

amc sector packs list              # 40 packs across 7 stations
amc sector score --pack digital-health-record --agent my-agent
amc sector gaps --pack clinical-trials --agent my-agent
amc sector report --pack drug-discovery --output reports/drug.md
Station Packs Focus
🌿 Environment 6 Farm-to-fork, textiles, manufacturing, energy, water
🏥 Health 9 EHR, clinical trials, drug discovery, precision medicine
💰 Wealth 5 Payments, financial inclusion, DeFi, circular economy
🎓 Education 5 K-12, higher ed, skills training, accessibility
🚇 Mobility 5 Smart cities, ports, real estate, cloud infra, privacy
💡 Technology 5 AI intelligence, ecosystems, infotainment, IP partnerships
🏛️ Governance 5 Digital identity, elections, legislation, citizen services

382 questions with specific regulatory article references (HIPAA §164.312(a)(1), EU AI Act Art. 5(1)(a), FERPA 20 U.S.C. §1232g, UNECE WP.29 R155 §7, UNCAC Art. 7). Every pack includes riskTier, EU AI Act classification, SDG alignment, certification path, and key risks.

📖 Full Sector Packs docs


📊 The Platform (8 modules)
Module What It Does
AMC Score 138 diagnostic questions, 5 dimensions, L0–L5 maturity, evidence-weighted
AMC Shield 74 attack packs: injection, exfiltration, sycophancy, sabotage, over-compliance, and more
AMC Enforce Governor engine with policy packs, approval workflows, scoped leases
AMC Vault Ed25519 key vault, Merkle-tree evidence chains, HSM/TPM support
AMC Watch Studio dashboard, gateway proxy, Prometheus metrics, cost tracking
AMC Fleet Multi-agent trust composition, delegation graphs, contradiction detection
AMC Passport Portable agent credential (.amcpass), verifiable offline
AMC Comply EU AI Act, ISO 42001, NIST AI RMF, SOC 2 compliance mapping
📐 5 Dimensions, 126 Questions, 6 Maturity Levels
Dimension Questions Focus
Strategic Agent Operations 17 Mission clarity, scope adherence, decision traceability
Skills 35 Tool mastery, injection defense, DLP, zero-trust
Resilience 30 Graceful degradation, circuit breakers, monitor bypass resistance
Leadership & Autonomy 21 Structured logs, traces, cost tracking, SLO monitoring
Culture & Alignment 23 Test harnesses, benchmarks, feedback loops, over-compliance detection
Level Name Description
L0 Absent No structure. Reactive. Fragile.
L1 Initial Intent exists but isn't operational.
L2 Developing Partial structure. Edge cases break.
L3 Defined Repeatable. Measurable. Auditable.
L4 Managed Proactive. Risk-calibrated. Stress-tested.
L5 Optimizing Self-correcting. Certified. Continuously verified.
🔴 Assurance Lab (Built-in Red Team)

AMC doesn't just score — it attacks. 74 deterministic attack packs including:

  • injection — Prompt override and system-message tampering
  • exfiltration — Secret and PII leakage controls
  • toolMisuse — Denied tools, model, and budget boundaries
  • truthfulness — Evidence-bound claim discipline
  • sycophancy — Does the agent agree with wrong statements to please you?
  • self-preservation — Does the agent resist shutdown or modification?
  • sabotage — Does the agent subtly undermine goals when conflicted?
  • adversarial-robustness — TAP/PAIR, Crescendo, Skeleton Key attacks
  • context-leakage — EchoLeak, cross-session data bleed
  • operational-discipline — Supply chain integrity, MCP poisoning
  • agent-as-proxy — Indirect prompt injection via agent delegation chains
  • economic-amplification — Cost explosion and resource exhaustion attacks
  • mcp-security — MCP server poisoning, tool schema manipulation
  • zombie-persistence — Agents that survive termination or persist unauthorized
  • over-compliance — H-Neurons-inspired detection of agents that exceed instructions (arXiv:2512.01797)
amc assurance run --scope full --agent my-agent
🔬 74 Scoring Modules

Beyond the core diagnostic, AMC includes research-backed scoring:

  • Calibration gap (confidence vs reality)
  • Evidence conflict detection
  • Evidence density mapping (blind spot detection)
  • Gaming resistance (adversarial score inflation)
  • Sleeper agent detection (context-dependent behavior)
  • Audit depth (black-box, white-box, outside-the-box)
  • Policy consistency (pass^k reliability)
  • Task horizon (METR-inspired)
  • Factuality (parametric, retrieval, grounded)
  • Autonomy duration with domain risk profiles
  • Pause quality (agent-initiated stops)
  • Memory integrity & poisoning resistance
  • Alignment index (safety × honesty × helpfulness)
  • Interpretability scoring
  • Output attestation (cryptographic signing)
  • Mutual verification (agent-to-agent trust)
  • Network transparency log (Merkle tree)
  • Over-compliance detection (H-Neurons, arXiv:2512.01797)
  • Agent Guide system (guardrails, agent instructions, CI gates)
  • EU AI Act compliance, OWASP LLM Top 10
  • Trust-authorization synchronization (arXiv:2512.06914)
  • Monitor bypass resistance (arXiv:2503.09950)
  • Adaptive access control (arXiv:2504.12345)
  • Memory security architecture (arXiv:2503.10632)
  • Agent protocol security (MCP/A2A hardening)
  • And more...
📋 Compliance
Framework Status
EU AI Act 12 article mappings, audit binder generation
ISO 42001 Clauses 4-10 mapped to AMC dimensions
NIST AI RMF Risk management framework alignment
SOC 2 Trust service criteria mapping
OWASP LLM Top 10 Full coverage (10/10)
amc audit binder create --framework eu-ai-act
📚 Documentation

Install Options

# npm (recommended)
npm i -g agent-maturity-compass

# From source
git clone https://github.com/thewisecrab/AgentMaturityCompass.git
cd AgentMaturityCompass && npm ci && npm run build && npm link

# Docker
docker run -p 3212:3212 -p 3210:3210 amc/studio

Contributing

AMC is MIT licensed and open source. Contributions welcome.

  1. Fork → branch → npm test → PR

License

MIT — public infrastructure for the age of AI agents.

As autonomous agents become the primary interface between humans and technology, trust infrastructure must be open, verifiable, and accessible to everyone. AMC exists to make that real.

About

The credit score for AI agents. 113-question diagnostic, 69 scoring modules, 66 adversarial attack packs, 4064 tests. Evidence-gated trust scoring with cryptographic proof chains.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors