This file is the current map of planned or implied Iris features that are not
fully implemented. Historical specs and phase plans under docs/superpowers/
are useful design records, but this file should be treated as the maintained
status page.
| Area | Status |
|---|---|
| Web target evaluation | Implemented through @iris/adapter-web with Playwright, DOM/screenshot observations, web tools, probes, downloads, video evidence, Discovery survey, Judge, report, replay, and revalidation. |
| CLI target evaluation | Implemented through @iris/adapter-cli for command execution, stdout/stderr/exit-code evidence, CLI Discovery, CLI rubrics, setup, report/replay/revalidation, and Codex/Claude provider paths. |
| Targeted user instructions | Implemented through --spec, repeated --task, --tasks, and --continue-from. |
| Agent feedback output | Implemented as agent-feedback.json, derived from report.json with gate posture, uncertainty, agent trajectory audit, retest guidance, usage, and artifact links. |
| Report replay | Implemented through iris judge and iris report --revalidate. |
| Cross-run comparison | Implemented as iris diff, not the older planned iris compare name. |
TargetKind reserves api, and the original design describes API testing, but
there is no packages/adapter-api today. --transport api is a provider
transport for raw Anthropic API calls; it is not API-product testing.
Needed before this is real:
- OpenAPI/Swagger or request-collection ingestion.
- API-native tools for requests, auth, pagination, retries, and follow-up reads.
- API outcome contract requiring readback evidence, not just HTTP 200 side effects.
- API rubrics and API Discovery survey.
- Saved-run judge/report replay using API outcome semantics.
TargetKind reserves desktop, but there is no desktop adapter package.
Needed before this is real:
- App launch/attach lifecycle.
- Accessibility-tree and screenshot observations.
- Native click/type/menu/window tools.
- Screen recording or screenshot evidence slicing.
- Desktop outcome contract and rubrics.
The original design mentions an npx-style distributable. The repo is still a
private pnpm workspace with package versions at 0.0.0. Development usage is:
pnpm install
pnpm build
node packages/cli/dist/bin.js --helpNeeded before public package use:
- Package naming/versioning decision.
- Publishable package metadata and files.
- Release build/test workflow.
- Install smoke tests outside the monorepo.
The original design mentions future iris serve --mcp. There is no MCP server
command today. Current integration points are CLI commands and generated report
artifacts.
Needed before this is real:
- MCP tool schema for eval/judge/report/diff.
- Streaming progress and artifact discovery.
- Stable workspace/output-directory policy.
- Auth/provider configuration story for non-interactive MCP clients.
--json-logs is accepted for compatibility, but the help text says it is
reserved. Iris still emits mostly human-oriented progress lines to stderr.
Needed before this is real:
- Event schema for phase start/end, model calls, tool calls, costs, artifacts, warnings, and terminal verdicts.
- One JSON object per stderr line.
- Tests that forbid plain text when
--json-logsis enabled.
The web adapter can import/export Playwright storage state internally, and web
--agents N is available for Claude Code and Codex App Server when no setup
state must be shared. --share-auth still only supports the Claude Code
bootstrap flow. There is not yet a simple user-facing --auth storageState.json
flag.
Needed before this is real:
- CLI option and validation.
- Report metadata showing auth state was supplied without leaking secrets.
- Clear interaction with
--setup,--share-auth, replay, Codex App Server multi-agent sessions, and redaction.
CLI targets are intentionally blocked for --agents >1. Worktrees isolate git
state, not process/filesystem state inside a product under test. Parallel CLI
sessions need an explicit isolation contract.
Needed before this is real:
- Per-session cwd/state copy or fixture snapshot.
- Environment variable and temp-file policy.
- Merge/cleanup behavior for evidence artifacts.
- Tests proving sessions cannot race the same state file.
The CLI adapter currently runs commands and captures stdout/stderr/exit code/filesystem evidence. It is not a PTY/TUI automation layer.
Potential future work:
- Interactive stdin conversations.
- Signals and long-running processes.
- Terminal UI screen-buffer observations.
- Command timeout/retry policy tuned for daemons and REPLs.
Iris supports two web multi-session primitives: --agents N for sharding
planned product goals and --perspectives for corroborating the same goals
through existing Explorer modes. Reports include an agent trajectory audit, so
stuck or duplicate sessions are visible.
Coverage planning now preserves Discovery's product-native scope before sharding. Runtime safety caps are diagnostics; they do not delete goals from the product denominator.
Normal web multi-agent runs can still propose bounded bonus goals after assigned learned goals are terminal. Proposed goals use non-overlapping ids and are reported as expansion/follow-up coverage. Perspective/corroboration sessions do not receive expansion budget.
What is still not implemented is a richer pre-Explorer coordinator that prioritizes or sequences very large product maps by release risk. The current planner preserves learned goals, performs semantic artifact-editor dedupe, and assigns the resulting work to agents.
Needed before this is real:
- Coordinator that derives richer disjoint coverage charters from Discovery capability gaps before Explorer starts, not from persona labels.
- Merge/reconcile policy for conflicting new findings and goal outcomes.
- Cost/time reporting by coverage charter.
pnpm bench exists, but scheduled nightly/release automation is not in this
repo. Bench runs still use real provider calls and can vary.
Needed before this is real:
- CI or scheduled runner integration.
- Provider credential policy.
- Stochastic tolerance and artifact retention policy.
- Regression dashboard or summary artifact.
agent-feedback.json gives build agents a better feedback packet, but Iris does
not directly drive Otto's queue, retry, merge, or task protocol. That remains an
integration layer outside Iris.
Needed before this is real:
- Agreed Otto input/output contract.
- Token-efficient feedback format if
agent-feedback.jsonis too verbose. - Retry policy using Iris uncertainty without turning Iris into a hard gate.
- Pixel-perfect visual regression as the main product. Iris can use visual evidence, but it is a product-behavior evaluator, not a screenshot diff tool.
- Screen-reader automation. Accessibility probes exist for web, but Iris does not operate products through a real screen reader.
- Complex credential/external-state flows as a baseline guarantee. They can be tested when setup/auth are provided, but they are not assumed to work automatically.