Skip to content

docs: plan audio/video context support#669

Open
nabinchha wants to merge 6 commits into
mainfrom
nmulepati/feat-support-audio-video-context
Open

docs: plan audio/video context support#669
nabinchha wants to merge 6 commits into
mainfrom
nmulepati/feat-support-audio-video-context

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

📋 Summary

Adds a design plan for supporting audio and video as multimodal context, following the existing image-context pattern. The plan keeps the issue scope limited to design work while clarifying config boundaries, provider translation responsibilities, and test coverage for a future implementation.

🔗 Related Issue

Closes #668

🔄 Changes

  • Adds plans/668/audio-video-context.md with the proposed AudioContext / VideoContext API shape.
  • Clarifies that audio/video context values are URL or base64 only, with local path handling and provider file metadata out of scope for v1.
  • Documents provider translation boundaries, unsupported-provider behavior, implementation phases, and test strategy.

🧪 Testing

  • make test passes (not run; markdown-only planning change)
  • Unit tests added/updated (N/A; no testable logic)
  • E2E tests added/updated (N/A; no testable logic)
  • Commit hooks passed for markdown-safe checks

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (plan added under plans/668/)

Closes #668

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner May 15, 2026 21:24
@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #669docs: plan audio/video context support

Summary

A plan-only PR that adds plans/668/audio-video-context.md (275 lines, no code). It proposes a declarative AudioContext / VideoContext API mirroring the existing ImageContext pattern, a canonical media-block schema as the boundary between config and adapters, provider translation rules, and a 3-PR rollout. The scope is constrained to media-as-input-context — no generation, no heavy deps in config import paths, no local-path handling for audio/video in v1.

Findings

Strengths

  • Architectural alignment is solid. The plan explicitly preserves the interface → engine → config import direction and the "declare, don't orchestrate" contract from AGENTS.md. The canonical media-block schema lives between config (descriptive) and adapters (provider-specific), which keeps OpenAI/Anthropic shapes from leaking back into config — consistent with the existing principle that "errors normalize at boundaries."
  • Conservative non-goals. Excluding ffmpeg/moviepy from config imports, deferring frame-extraction fallbacks, deferring the Responses API migration, and deferring provider file-upload lifecycle are all the right calls for v1 and prevent scope creep.
  • Clear adapter responsibility split. Image stays unchanged on Anthropic; audio/video raise canonical unsupported errors before transport. OpenAI-compatible owns image_url / input_audio translation. This is the right place for capability gating.
  • Good rollout atomicity. Three PRs (config foundation → engine/adapters → docs) match the package layering and let the config layer ship and be tested independently.
  • Risks section is honest about video provider variance, base64 size blowup, and the silent-download trap for remote URLs.

Gaps and ambiguities

  1. Discriminated-union migration of ImageContext is under-specified. Verified locally: ImageContext.modality is currently Modality = Modality.IMAGE (not Literal). Step 1 says to "change concrete context modality fields to Literal[...] values." That's a schema change on an existing public type. The plan should call out:

    • whether previously serialized configs (which may omit modality and rely on the default) still round-trip — Pydantic discriminated unions typically require the discriminator key to be present;
    • whether the column-config field type change from list[ImageContext] | None to list[MultiModalContextT] | None is a breaking type-narrowing for downstream type-checkers or plugin authors who annotated against ImageContext.

    A one-line backward-compat note in the Design Decisions table (alongside the existing "legacy image_url" entry) would close this.

  2. data_type=None auto-detection behavior for audio/video is unspecified. ImageContext auto-detects URL vs base64 vs local path when data_type is omitted. The plan says audio/video are URL-or-base64 only, but doesn't say whether data_type is required for the new contexts or whether auto-detection is supported (URL vs base64). Either choice is fine, but the plan should pick one — otherwise implementers will diverge. Test Plan mentions "explicit base64 requires an explicit format" but not the data_type=None case.

  3. Canonical error type isn't named. Step 5 and the Test Plan reference "canonical unsupported-capability provider error" but don't say whether this is an existing class in clients/errors.py or a new one (e.g., UnsupportedModalityError). Naming it in the plan avoids ad-hoc choices in PR 2.

  4. Missing: schema-export impact. DataDesigner exports JSON schema for configs (used by docs/UI). Switching multi_modal_context to a discriminated union changes the emitted schema shape. Worth a one-liner in the Test Plan ("schema export snapshot updated and reviewed") or Risks.

  5. Step 2's "extract shared helpers" lacks a deprecation policy. The plan says to move reusable normalization out of image_helpers.py into media_helpers.py, but doesn't say whether the old image-helper public surface is preserved, deprecated, or re-exported. If anything imports from image_helpers.py outside config (or in plugins), this matters.

  6. MIME / format mapping is only sketched. The Canonical Block Shape gives audio/mpeg for mp3 and video/mp4 for mp4, but Step 2 just says "Add MIME helpers for audio/video formats." A small explicit table (format enum → media_type → OpenAI format value) would prevent inconsistency between Step 2 and Step 5.

  7. Open question worth surfacing: list-valued media columns. The plan supports list/JSON-list values (consistent with ImageContext), but for audio/video this can mean N base64 blobs in one user message. Provider per-message size limits and the order of multi-block flattening should be stated, even if just as "blocks are emitted in list order, no dedup."

Minor

  • The Mermaid diagram is correct but the engine-side label _build_multi_modal_context lives on ColumnGeneratorWithModel — fine as-is; the rendering will be tight on narrow viewers.
  • "OpenAI-compatible adapter forwards image_url blocks unchanged" in Current State is accurate; it might be worth noting in Step 5 that NIM and other OpenAI-compatible providers may not all accept input_audio, so the audio path may need a per-endpoint capability flag rather than a flat "OpenAI-compatible supports audio."
  • Provider links use stable docs URLs — good. Anthropic's vision URL pinned to platform.claude.com is the current canonical host.

Verdict

Approve with revisions. The plan is well-scoped, architecturally aligned, and rolls out cleanly across the three packages. Before PR 1 lands, the plan should be tightened on: (a) the ImageContext modality field migration and config backward compatibility, (b) data_type=None auto-detection policy for the new contexts, (c) the named canonical error class, and (d) explicit MIME / format mapping. None of these block the design itself; they're loose ends that will otherwise be re-litigated in the implementation PRs.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR adds plans/668/audio-video-context.md, a design-only document proposing AudioContext and VideoContext as first-class multimodal context types for LLMTextColumnConfig, following the existing ImageContext pattern. Previous review feedback has been fully incorporated: ImageColumnConfig.multi_modal_context stays image-only, canonical audio blocks no longer carry an adapter-specific format field, and the ImageContext.get_contexts() migration from image_url to canonical image blocks is explicitly resolved.

  • Introduces a discriminated-union MultiModalContextT for text-generation columns, with a legacy pre-validator to inject modality=\"image\" for existing serialized configs that omit the new field.
  • Defines a canonical block schema that keeps all provider-specific payload construction inside adapters; the Anthropic adapter rejects audio/video with ProviderError.unsupported_capability, and the OpenAI-compatible adapter derives input_audio.format from media_type during translation.
  • Scopes v1 to URL and base64 sources only, explicitly deferring local-path resolution, provider file uploads, ffmpeg, and video frame extraction to future issues.

Confidence Score: 5/5

Safe to merge — this is a markdown-only design document with no executable code changes.

The plan is well-structured and all three issues raised in previous review rounds are cleanly resolved. The one remaining gap around data URI vs explicit format conflict is a design detail worth clarifying before the implementation PR lands, but it does not block this planning document.

No files require special attention; the single changed file is a planning document.

Important Files Changed

Filename Overview
plans/668/audio-video-context.md New design plan for AudioContext/VideoContext; internally consistent with a minor ambiguity in data_type=None auto-detection when a base64 data URI is supplied alongside an explicit format enum.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    user["User Config\nAudioContext / VideoContext / ImageContext\ncolumn_name + data_type + format"]
    user --> ctx["MultiModalContextT union\nLLMTextColumnConfig.multi_modal_context\nDiscriminated by modality field"]
    ctx --> build["ColumnGeneratorWithModel\n_build_multi_modal_context\nPasses base_path for images only"]
    build --> canonical["Canonical Block Dict\ntype: image/audio/video\nsource.type: url/base64\nsource.media_type for base64"]
    canonical --> prompt["prompt_to_messages\nMedia blocks placed before text prompt"]
    prompt --> facade["ModelFacade\nChatCompletionRequest"]
    facade --> oai["OpenAICompatibleAdapter\nimage→image_url\naudio base64→input_audio\nvideo→provider parts if supported\nCapability gate before transport"]
    facade --> ant["AnthropicAdapter\nimage→Claude image block\naudio/video→unsupported_capability error"]
    oai --> api["Provider API"]
    ant --> api
    style ant fill:#f9a,stroke:#c00
    style oai fill:#aef,stroke:#06c
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
plans/668/audio-video-context.md:93
**Conflict between data-URI inferred `media_type` and explicit `audio_format`/`video_format` is unspecified**

The design decision says base64 data URIs are "base64 sources with inferred media type," and Step 1 repeats this. But the `AudioContext`/`VideoContext` constructors also accept `audio_format`/`video_format`. Neither location specifies what happens when both are supplied and they disagree (e.g. `data:audio/wav;base64,...` with `audio_format=AudioFormat.MP3`). The implementation could silently emit the wrong `media_type` in the canonical block — which would cause the OpenAI adapter to derive the wrong `input_audio.format`. The test plan covers the "plain base64 requires explicit format" path but has no case for this conflict. The plan should say which wins (data URI takes precedence, explicit format overrides, or a validation error if they disagree).

Reviews (6): Last reviewed commit: "docs: resolve image context canonical mi..." | Re-trigger Greptile

Comment thread plans/668/audio-video-context.md
Comment thread plans/668/audio-video-context.md
- OpenAI-compatible:
- Translate canonical `image` blocks to `image_url`.
- Translate canonical base64 `audio` blocks to `input_audio`.
- Translate supported `video` blocks to provider-specific media content parts only when the provider route supports them.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs one more concrete capability gate before implementation. Today DataDesigner only knows broad adapter capabilities like chat/image/embedding, but this plan needs modality/source/format decisions, like whether this provider/model accepts audio URL or video base64. Without that, unsupported media will likely fall through to raw provider 400s instead of the canonical unsupported error the plan is aiming for.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b152eca: added an adapter-level capability gate for modality, source type, and media type before transport, raising ProviderError.unsupported_capability / ProviderErrorKind.UNSUPPORTED_CAPABILITY for unsupported combinations.

Comment thread plans/668/audio-video-context.md Outdated
Comment thread plans/668/audio-video-context.md Outdated
Comment thread plans/668/audio-video-context.md
Comment thread plans/668/audio-video-context.md Outdated
nabinchha added 3 commits May 18, 2026 15:17
Tighten the plan around legacy image-context migration, audio/video auto-detection, config-layer canonical blocks, capability gating, ImageColumnConfig scope, and single-PR implementation rollout.

Refs #668
@nabinchha nabinchha requested a review from andreatgretel May 18, 2026 21:29

### 4. Keep engine prompt plumbing generic

Files:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ImageContext.get_contexts() migration is unresolved

Implementation step 1 says "Preserve ImageContext behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical {"type": "image", "source": {...}} blocks — not the {"type": "image_url", ...} dicts that ImageContext.get_contexts() currently emits. Step 5 then lists "Translate canonical image blocks to image_url" as an explicit adapter responsibility.

If ImageContext.get_contexts() is not updated to emit canonical blocks, the OpenAI adapter's image-block translation path is dead code and the existing image-context flow silently breaks. If it is updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the image_url dict format will need updates — contradicting "existing image-context tests remain green."

The plan needs to resolve this explicitly: either commit to migrating ImageContext.get_contexts() to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep ImageContext emitting image_url blocks and document that adapters will dual-dispatch on both image_url and canonical image block shapes.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/668/audio-video-context.md
Line: 218

Comment:
**`ImageContext.get_contexts()` migration is unresolved**

Implementation step 1 says "Preserve `ImageContext` behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical `{"type": "image", "source": {...}}` blocks — not the `{"type": "image_url", ...}` dicts that `ImageContext.get_contexts()` currently emits. Step 5 then lists "Translate canonical `image` blocks to `image_url`" as an explicit adapter responsibility.

If `ImageContext.get_contexts()` is **not** updated to emit canonical blocks, the OpenAI adapter's `image`-block translation path is dead code and the existing image-context flow silently breaks. If it **is** updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the `image_url` dict format will need updates — contradicting "existing image-context tests remain green."

The plan needs to resolve this explicitly: either commit to migrating `ImageContext.get_contexts()` to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep `ImageContext` emitting `image_url` blocks and document that adapters will dual-dispatch on both `image_url` and canonical `image` block shapes.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 57f92eb: the plan now explicitly chooses the canonical migration path for ImageContext.get_contexts(). ImageContext keeps its public API/user-visible behavior, but get_contexts() emits canonical image blocks; adapters keep accepting legacy image_url blocks during the transition, and tests that inspect pre-adapter blocks should be updated to expect the canonical shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plan audio and video context support

2 participants