Skip to content

Latest commit

 

History

History
190 lines (136 loc) · 7.4 KB

File metadata and controls

190 lines (136 loc) · 7.4 KB

Roadmap And Gaps

This file is the current map of planned or implied Iris features that are not fully implemented. Historical specs and phase plans under docs/superpowers/ are useful design records, but this file should be treated as the maintained status page.

Implemented Today

Area Status
Web target evaluation Implemented through @iris/adapter-web with Playwright, DOM/screenshot observations, web tools, probes, downloads, video evidence, Discovery survey, Judge, report, replay, and revalidation.
CLI target evaluation Implemented through @iris/adapter-cli for command execution, stdout/stderr/exit-code evidence, CLI Discovery, CLI rubrics, setup, report/replay/revalidation, and Codex/Claude provider paths.
Targeted user instructions Implemented through --spec, repeated --task, --tasks, and --continue-from.
Agent feedback output Implemented as agent-feedback.json, derived from report.json with gate posture, uncertainty, agent trajectory audit, retest guidance, usage, and artifact links.
Report replay Implemented through iris judge and iris report --revalidate.
Cross-run comparison Implemented as iris diff, not the older planned iris compare name.

Planned But Not Implemented

API Product Adapter

TargetKind reserves api, and the original design describes API testing, but there is no packages/adapter-api today. --transport api is a provider transport for raw Anthropic API calls; it is not API-product testing.

Needed before this is real:

  • OpenAPI/Swagger or request-collection ingestion.
  • API-native tools for requests, auth, pagination, retries, and follow-up reads.
  • API outcome contract requiring readback evidence, not just HTTP 200 side effects.
  • API rubrics and API Discovery survey.
  • Saved-run judge/report replay using API outcome semantics.

Desktop Product Adapter

TargetKind reserves desktop, but there is no desktop adapter package.

Needed before this is real:

  • App launch/attach lifecycle.
  • Accessibility-tree and screenshot observations.
  • Native click/type/menu/window tools.
  • Screen recording or screenshot evidence slicing.
  • Desktop outcome contract and rubrics.

Published npx / Package Distribution

The original design mentions an npx-style distributable. The repo is still a private pnpm workspace with package versions at 0.0.0. Development usage is:

pnpm install
pnpm build
node packages/cli/dist/bin.js --help

Needed before public package use:

  • Package naming/versioning decision.
  • Publishable package metadata and files.
  • Release build/test workflow.
  • Install smoke tests outside the monorepo.

MCP Server Mode

The original design mentions future iris serve --mcp. There is no MCP server command today. Current integration points are CLI commands and generated report artifacts.

Needed before this is real:

  • MCP tool schema for eval/judge/report/diff.
  • Streaming progress and artifact discovery.
  • Stable workspace/output-directory policy.
  • Auth/provider configuration story for non-interactive MCP clients.

Structured JSON Logs

--json-logs is accepted for compatibility, but the help text says it is reserved. Iris still emits mostly human-oriented progress lines to stderr.

Needed before this is real:

  • Event schema for phase start/end, model calls, tool calls, costs, artifacts, warnings, and terminal verdicts.
  • One JSON object per stderr line.
  • Tests that forbid plain text when --json-logs is enabled.

User-Supplied Web Auth State

The web adapter can import/export Playwright storage state internally, and web --agents N is available for Claude Code and Codex App Server when no setup state must be shared. --share-auth still only supports the Claude Code bootstrap flow. There is not yet a simple user-facing --auth storageState.json flag.

Needed before this is real:

  • CLI option and validation.
  • Report metadata showing auth state was supplied without leaking secrets.
  • Clear interaction with --setup, --share-auth, replay, Codex App Server multi-agent sessions, and redaction.

CLI Parallel Isolation

CLI targets are intentionally blocked for --agents >1. Worktrees isolate git state, not process/filesystem state inside a product under test. Parallel CLI sessions need an explicit isolation contract.

Needed before this is real:

  • Per-session cwd/state copy or fixture snapshot.
  • Environment variable and temp-file policy.
  • Merge/cleanup behavior for evidence artifacts.
  • Tests proving sessions cannot race the same state file.

Full CLI Process/TUI Semantics

The CLI adapter currently runs commands and captures stdout/stderr/exit code/filesystem evidence. It is not a PTY/TUI automation layer.

Potential future work:

  • Interactive stdin conversations.
  • Signals and long-running processes.
  • Terminal UI screen-buffer observations.
  • Command timeout/retry policy tuned for daemons and REPLs.

Coverage-Plan Multi-Agent Runs

Iris supports two web multi-session primitives: --agents N for sharding planned product goals and --perspectives for corroborating the same goals through existing Explorer modes. Reports include an agent trajectory audit, so stuck or duplicate sessions are visible.

Coverage planning now preserves Discovery's product-native scope before sharding. Runtime safety caps are diagnostics; they do not delete goals from the product denominator.

Normal web multi-agent runs can still propose bounded bonus goals after assigned learned goals are terminal. Proposed goals use non-overlapping ids and are reported as expansion/follow-up coverage. Perspective/corroboration sessions do not receive expansion budget.

What is still not implemented is a richer pre-Explorer coordinator that prioritizes or sequences very large product maps by release risk. The current planner preserves learned goals, performs semantic artifact-editor dedupe, and assigns the resulting work to agents.

Needed before this is real:

  • Coordinator that derives richer disjoint coverage charters from Discovery capability gaps before Explorer starts, not from persona labels.
  • Merge/reconcile policy for conflicting new findings and goal outcomes.
  • Cost/time reporting by coverage charter.

Bench Automation

pnpm bench exists, but scheduled nightly/release automation is not in this repo. Bench runs still use real provider calls and can vary.

Needed before this is real:

  • CI or scheduled runner integration.
  • Provider credential policy.
  • Stochastic tolerance and artifact retention policy.
  • Regression dashboard or summary artifact.

First-Class Otto Control Loop

agent-feedback.json gives build agents a better feedback packet, but Iris does not directly drive Otto's queue, retry, merge, or task protocol. That remains an integration layer outside Iris.

Needed before this is real:

  • Agreed Otto input/output contract.
  • Token-efficient feedback format if agent-feedback.json is too verbose.
  • Retry policy using Iris uncertainty without turning Iris into a hard gate.

Intentionally Out Of Scope For Now

  • Pixel-perfect visual regression as the main product. Iris can use visual evidence, but it is a product-behavior evaluator, not a screenshot diff tool.
  • Screen-reader automation. Accessibility probes exist for web, but Iris does not operate products through a real screen reader.
  • Complex credential/external-state flows as a baseline guarantee. They can be tested when setup/auth are provided, but they are not assumed to work automatically.