docs: plan audio/video context support#669
Conversation
Closes #668 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Review: PR #669 —
|
Greptile SummaryThis PR adds
|
| Filename | Overview |
|---|---|
| plans/668/audio-video-context.md | New design plan for AudioContext/VideoContext; internally consistent with a minor ambiguity in data_type=None auto-detection when a base64 data URI is supplied alongside an explicit format enum. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
user["User Config\nAudioContext / VideoContext / ImageContext\ncolumn_name + data_type + format"]
user --> ctx["MultiModalContextT union\nLLMTextColumnConfig.multi_modal_context\nDiscriminated by modality field"]
ctx --> build["ColumnGeneratorWithModel\n_build_multi_modal_context\nPasses base_path for images only"]
build --> canonical["Canonical Block Dict\ntype: image/audio/video\nsource.type: url/base64\nsource.media_type for base64"]
canonical --> prompt["prompt_to_messages\nMedia blocks placed before text prompt"]
prompt --> facade["ModelFacade\nChatCompletionRequest"]
facade --> oai["OpenAICompatibleAdapter\nimage→image_url\naudio base64→input_audio\nvideo→provider parts if supported\nCapability gate before transport"]
facade --> ant["AnthropicAdapter\nimage→Claude image block\naudio/video→unsupported_capability error"]
oai --> api["Provider API"]
ant --> api
style ant fill:#f9a,stroke:#c00
style oai fill:#aef,stroke:#06c
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
plans/668/audio-video-context.md:93
**Conflict between data-URI inferred `media_type` and explicit `audio_format`/`video_format` is unspecified**
The design decision says base64 data URIs are "base64 sources with inferred media type," and Step 1 repeats this. But the `AudioContext`/`VideoContext` constructors also accept `audio_format`/`video_format`. Neither location specifies what happens when both are supplied and they disagree (e.g. `data:audio/wav;base64,...` with `audio_format=AudioFormat.MP3`). The implementation could silently emit the wrong `media_type` in the canonical block — which would cause the OpenAI adapter to derive the wrong `input_audio.format`. The test plan covers the "plain base64 requires explicit format" path but has no case for this conflict. The plan should say which wins (data URI takes precedence, explicit format overrides, or a validation error if they disagree).
Reviews (6): Last reviewed commit: "docs: resolve image context canonical mi..." | Re-trigger Greptile
| - OpenAI-compatible: | ||
| - Translate canonical `image` blocks to `image_url`. | ||
| - Translate canonical base64 `audio` blocks to `input_audio`. | ||
| - Translate supported `video` blocks to provider-specific media content parts only when the provider route supports them. |
There was a problem hiding this comment.
I think this needs one more concrete capability gate before implementation. Today DataDesigner only knows broad adapter capabilities like chat/image/embedding, but this plan needs modality/source/format decisions, like whether this provider/model accepts audio URL or video base64. Without that, unsupported media will likely fall through to raw provider 400s instead of the canonical unsupported error the plan is aiming for.
There was a problem hiding this comment.
Addressed in b152eca: added an adapter-level capability gate for modality, source type, and media type before transport, raising ProviderError.unsupported_capability / ProviderErrorKind.UNSUPPORTED_CAPABILITY for unsupported combinations.
Tighten the plan around legacy image-context migration, audio/video auto-detection, config-layer canonical blocks, capability gating, ImageColumnConfig scope, and single-PR implementation rollout. Refs #668
|
|
||
| ### 4. Keep engine prompt plumbing generic | ||
|
|
||
| Files: |
There was a problem hiding this comment.
ImageContext.get_contexts() migration is unresolved
Implementation step 1 says "Preserve ImageContext behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical {"type": "image", "source": {...}} blocks — not the {"type": "image_url", ...} dicts that ImageContext.get_contexts() currently emits. Step 5 then lists "Translate canonical image blocks to image_url" as an explicit adapter responsibility.
If ImageContext.get_contexts() is not updated to emit canonical blocks, the OpenAI adapter's image-block translation path is dead code and the existing image-context flow silently breaks. If it is updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the image_url dict format will need updates — contradicting "existing image-context tests remain green."
The plan needs to resolve this explicitly: either commit to migrating ImageContext.get_contexts() to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep ImageContext emitting image_url blocks and document that adapters will dual-dispatch on both image_url and canonical image block shapes.
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/668/audio-video-context.md
Line: 218
Comment:
**`ImageContext.get_contexts()` migration is unresolved**
Implementation step 1 says "Preserve `ImageContext` behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical `{"type": "image", "source": {...}}` blocks — not the `{"type": "image_url", ...}` dicts that `ImageContext.get_contexts()` currently emits. Step 5 then lists "Translate canonical `image` blocks to `image_url`" as an explicit adapter responsibility.
If `ImageContext.get_contexts()` is **not** updated to emit canonical blocks, the OpenAI adapter's `image`-block translation path is dead code and the existing image-context flow silently breaks. If it **is** updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the `image_url` dict format will need updates — contradicting "existing image-context tests remain green."
The plan needs to resolve this explicitly: either commit to migrating `ImageContext.get_contexts()` to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep `ImageContext` emitting `image_url` blocks and document that adapters will dual-dispatch on both `image_url` and canonical `image` block shapes.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Addressed in 57f92eb: the plan now explicitly chooses the canonical migration path for ImageContext.get_contexts(). ImageContext keeps its public API/user-visible behavior, but get_contexts() emits canonical image blocks; adapters keep accepting legacy image_url blocks during the transition, and tests that inspect pre-adapter blocks should be updated to expect the canonical shape.
📋 Summary
Adds a design plan for supporting audio and video as multimodal context, following the existing image-context pattern. The plan keeps the issue scope limited to design work while clarifying config boundaries, provider translation responsibilities, and test coverage for a future implementation.
🔗 Related Issue
Closes #668
🔄 Changes
plans/668/audio-video-context.mdwith the proposedAudioContext/VideoContextAPI shape.🧪 Testing
make testpasses (not run; markdown-only planning change)✅ Checklist
plans/668/)