Skip to content

Add media file support for LLM prompter#580

Open
Copilot wants to merge 8 commits intomasterfrom
copilot/add-media-support-to-llm-prompter
Open

Add media file support for LLM prompter#580
Copilot wants to merge 8 commits intomasterfrom
copilot/add-media-support-to-llm-prompter

Conversation

Copy link
Contributor

Copilot AI commented Mar 9, 2026

Extends llm-prompter to work with parent datasets that are media archives (zip files from image downloaders or media imports), not just text-based CSV/NDJSON datasets.

common/lib/llm.py

  • create_multimodal_content() now accepts media_files (local paths, base64-encoded) alongside existing media_urls
  • _format_media_block() — new helper for provider-specific content blocks:
    • Anthropic: image blocks for images, document blocks for video/audio
    • OpenAI: data URIs for images/video, input_audio format for audio
    • Google/others: data URI with image_url wrapper
  • generate_text() gains media_files parameter to pass local file paths

processors/machine_learning/llm_prompter.py

  • is_compatible_with() — accepts zip datasets with media_type in (image, video, audio)
  • get_options() — when parent is a media archive:
    • Shows media info panel instead of column bracket instructions
    • Hides text-only options (column selection, batching, truncation, media URL toggle)
  • process() — new media archive code path: iterates zip contents, skips metadata files, base64-encodes each media file, sends to LLM via media_files param. Catches model incompatibility errors (e.g. non-vision model receiving images) with clear user-facing messages.
  • validate_query() — relaxes column bracket requirement for media archives; allows empty user prompt when system prompt is provided

All existing text-based processing behavior is preserved in the else branch. All models and custom model IDs remain available — incompatibility is caught at generation time rather than upfront.

Original prompt

This section details on the original issue you should resolve

<issue_title>Add media support for LLM prompter</issue_title>
<issue_description>Update llm-prompter.py so that it can work with parent datasets that are media files, e.g. as part of zip datasets generated through the download-images.py processor or as a media import.

  • This mostly works by checking the type of the parent dataset and changing the LLM prompter options and functionality accordingly.
  • If the parent dataset is a media file (e.g. png) or a zip with media items, llm-prompter.py is still available but option fields for text-based datasets (e.g. column selection) should not be shown. For instance, information on the column brackets is redundant here, since no 'column' data is inserted. Instead, an info UserInput field clearly indicates that media files are attached to the prompt. Also show other crucial information if needed.
  • llm.py should be updated accordingly to support video, image, and audio prompts.
  • Clear evaluation and alerts, both in the back-end and front-end (validate_query), should tell the user if there's any issues with the dataset media files.
  • In principle, the flexible setup of llm-prompter.py should be retained. All current models as well as custom model IDs should be allowed, and it should be caught whether the model can't process this data type on generating results after the fact (e.g. some don't support vision/audio) (unless LangChain has an easy way of checking this, then it can be called in validate_query).</issue_description>

Comments on the Issue (you are @copilot in this section)


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 2 commits March 9, 2026 17:23
…ideo/audio

Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>
…defaults, and error detection comments

Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>
Copilot AI changed the title [WIP] Add media support for LLM prompter Add media file support for LLM prompter Mar 9, 2026
@sal-uva
Copy link
Member

sal-uva commented Mar 9, 2026

This should be mergeable. Not all vendors support video and audio, but that's acceptable in my opinion.

LLM prompter is now quite bulky and should be refactored, but maybe that's for another day.

@sal-uva sal-uva marked this pull request as ready for review March 11, 2026 10:29
Copilot AI review requested due to automatic review settings March 11, 2026 10:29
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the llm-prompter processor to support media-archive parent datasets (ZIPs containing image/video/audio), enabling multimodal prompting using locally extracted media files in addition to existing URL-based media inputs.

Changes:

  • Add ZIP media-archive compatibility and a dedicated processing path in LLMPrompter (options/UI + iteration + annotation mapping).
  • Extend LLMAdapter multimodal support to accept local media file paths (base64-encoded) alongside media URLs, with provider-specific formatting.
  • Minor UI/UX tweaks for annotation rendering and link wrapping, plus marking AudioExtractor ZIP outputs as media_type="audio".

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
common/lib/llm.py Adds local media_files support for multimodal prompts and provider-specific content block formatting.
processors/machine_learning/llm_prompter.py Adds media-archive dataset compatibility, media-specific options, and a ZIP iteration + LLM prompting path.
processors/audio/audio_extractor.py Marks resulting ZIP datasets as audio media type.
common/assets/llms.json Updates/renames several predefined model IDs and model card links.
webtool/templates/explorer/item-annotations.html Uses item_id consistently in DOM ids/classes and avoids variable shadowing.
webtool/static/css/explorer-annotation-generic.css Improves wrapping behavior for long annotation label/link text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.dataset.log(f"Could not load .metadata.json for annotation mapping: {e}. "
f"Annotations will use filenames as item IDs.")

for item in self.source_dataset.iterate_items(staging_area=staging_area, immediately_delete=False):
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterate_items(... immediately_delete=False) extracts all archive members into the staging area and keeps them until cleanup. For large media archives this can cause significant temporary disk usage even though each file is only needed for the current iteration. Consider using the default immediately_delete=True (or explicitly deleting item.file after the LLM call).

Suggested change
for item in self.source_dataset.iterate_items(staging_area=staging_area, immediately_delete=False):
for item in self.source_dataset.iterate_items(staging_area=staging_area):

Copilot uses AI. Check for mistakes.
self.dataset.update_status(f"Generating text at row {row:,}/"
f"{max_processed:,} with {model}{batch_str}")
# Now finally generate some text!
self.dataset.update_status(f"Processing {media_archive_type} file {row - 1:,}/{max_processed:,} "
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status message uses {row - 1} even though row is already 1-based and incremented at the top of the loop. This will display 0/... for the first processed file and stay off-by-one thereafter. Use row (or better, the processed counter i) consistently.

Suggested change
self.dataset.update_status(f"Processing {media_archive_type} file {row - 1:,}/{max_processed:,} "
self.dataset.update_status(f"Processing {media_archive_type} file {row:,}/{max_processed:,} "

Copilot uses AI. Check for mistakes.
@@ -128,14 +130,16 @@ def generate_text(
system_prompt: Optional[str] = None,
temperature: float = 0.1,
files: Optional[List[Union[str, Path, dict]]] = None,
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files is typed as List[Union[str, Path, dict]], but it’s passed into create_multimodal_content(media_urls=...) which now enforces str URLs only. This mismatch can mislead callers and will raise at runtime for Path/dict entries. Align the type hints/docs with the actual accepted type (e.g., List[str] for URLs) or expand create_multimodal_content to support the advertised unions.

Suggested change
files: Optional[List[Union[str, Path, dict]]] = None,
files: Optional[List[str]] = None,

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +201
mime_type = mimetypes.guess_type(url)[0] or "application/octet-stream"
media_category = mime_type.split("/")[0] # "image", "video", or "audio"
content.append(self._format_media_block(url=url, mime_type=mime_type, media_category=media_category))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mimetypes.guess_type(url) often returns None for URLs with query strings (e.g. ...jpg?size=...). That falls back to application/octet-stream, which makes media_category become application and will format Anthropic blocks as document instead of image. Consider deriving the MIME type from the URL path component (strip query/fragment) or default unknown URL types to image/* when use_media is intended for images.

Copilot uses AI. Check for mistakes.
Comment on lines +213 to +214
with file_path.open("rb") as f:
b64_data = base64.b64encode(f.read()).decode("utf-8")
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f.read() loads the entire media file into memory before base64 encoding. For video/audio archives this can easily be tens/hundreds of MB per item and may OOM the worker. Consider enforcing a maximum file size (with a clear error), downsampling/transcoding upstream, or at least logging and skipping files above a safe threshold.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to 16
"gpt-5.4": {
"name": "[OpenAI] GPT-5.4",
"model_card": "https://platform.openai.com/docs/models/gpt-5.4",
"provider": "openai"
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming model IDs (e.g. gpt-5.2gpt-5.4, gpt-5.2-progpt-5.4-pro, gemini-3.1-flash-lite...-preview) is a breaking change for any saved processor configs/datasets referencing the old IDs (they’ll no longer resolve to a provider in LLMAdapter.get_models). Consider keeping the old IDs as aliases (deprecated) or adding a lightweight migration path.

Copilot uses AI. Check for mistakes.
{% set item = from_datasets[annotation.from_dataset] %}
{% if item.type in processors %}
{% set processor_options = processors[item.type].get_options(config=__config) %}
{% set from_dataset_item = from_datasets[annotation.from_dataset] %}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processor_options is only defined when from_dataset_item.type in processors, but it’s used unconditionally later in the loop. If the dataset’s processor type is unknown/unregistered, this will raise an undefined-variable error and break rendering. Initialize processor_options to {} (or guard the later condition with from_dataset_item.type in processors) before iterating options.

Suggested change
{% set from_dataset_item = from_datasets[annotation.from_dataset] %}
{% set from_dataset_item = from_datasets[annotation.from_dataset] %}
{% set processor_options = {} %}

Copilot uses AI. Check for mistakes.
self.dataset.update_status(f"Processing {media_archive_type} files from archive")
staging_area = self.dataset.get_staging_area()
row = 0
max_processed = limit if limit else self.source_dataset.num_rows
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_processed = limit if limit else self.source_dataset.num_rows isn’t capped to the archive size (unlike the CSV/NDJSON branch which uses min(limit, num_rows)). If the user enters a limit larger than the archive, progress/status totals will be misleading. Cap max_processed to self.source_dataset.num_rows for consistent behavior.

Suggested change
max_processed = limit if limit else self.source_dataset.num_rows
max_processed = min(limit, self.source_dataset.num_rows) if limit else self.source_dataset.num_rows

Copilot uses AI. Check for mistakes.
batched_data = {}
n_batched = 0
# Skip 'signature' and 'type' annotations for Google
if provider == "google" and output_key in (".signature", ".type"):
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Google “signature/type” filter won’t match keys produced by flatten_dict({model: output_item}) (those keys are prefixed with the model id, e.g. <model>.extras.signature). As written, these metadata fields will likely still be saved as annotations. Consider filtering via output_key.endswith(".signature") / endswith(".type") (or strip the <model>. prefix before comparing).

Suggested change
if provider == "google" and output_key in (".signature", ".type"):
if provider == "google" and (
output_key.endswith(".signature") or output_key.endswith(".type")
):

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add media support for LLM prompter

3 participants