Conversation
…ideo/audio Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>
…defaults, and error detection comments Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>
…taset, where multiple media files belong to one post id
|
This should be mergeable. Not all vendors support video and audio, but that's acceptable in my opinion. LLM prompter is now quite bulky and should be refactored, but maybe that's for another day. |
There was a problem hiding this comment.
Pull request overview
This PR extends the llm-prompter processor to support media-archive parent datasets (ZIPs containing image/video/audio), enabling multimodal prompting using locally extracted media files in addition to existing URL-based media inputs.
Changes:
- Add ZIP media-archive compatibility and a dedicated processing path in
LLMPrompter(options/UI + iteration + annotation mapping). - Extend
LLMAdaptermultimodal support to accept local media file paths (base64-encoded) alongside media URLs, with provider-specific formatting. - Minor UI/UX tweaks for annotation rendering and link wrapping, plus marking
AudioExtractorZIP outputs asmedia_type="audio".
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
common/lib/llm.py |
Adds local media_files support for multimodal prompts and provider-specific content block formatting. |
processors/machine_learning/llm_prompter.py |
Adds media-archive dataset compatibility, media-specific options, and a ZIP iteration + LLM prompting path. |
processors/audio/audio_extractor.py |
Marks resulting ZIP datasets as audio media type. |
common/assets/llms.json |
Updates/renames several predefined model IDs and model card links. |
webtool/templates/explorer/item-annotations.html |
Uses item_id consistently in DOM ids/classes and avoids variable shadowing. |
webtool/static/css/explorer-annotation-generic.css |
Improves wrapping behavior for long annotation label/link text. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.dataset.log(f"Could not load .metadata.json for annotation mapping: {e}. " | ||
| f"Annotations will use filenames as item IDs.") | ||
|
|
||
| for item in self.source_dataset.iterate_items(staging_area=staging_area, immediately_delete=False): |
There was a problem hiding this comment.
iterate_items(... immediately_delete=False) extracts all archive members into the staging area and keeps them until cleanup. For large media archives this can cause significant temporary disk usage even though each file is only needed for the current iteration. Consider using the default immediately_delete=True (or explicitly deleting item.file after the LLM call).
| for item in self.source_dataset.iterate_items(staging_area=staging_area, immediately_delete=False): | |
| for item in self.source_dataset.iterate_items(staging_area=staging_area): |
| self.dataset.update_status(f"Generating text at row {row:,}/" | ||
| f"{max_processed:,} with {model}{batch_str}") | ||
| # Now finally generate some text! | ||
| self.dataset.update_status(f"Processing {media_archive_type} file {row - 1:,}/{max_processed:,} " |
There was a problem hiding this comment.
The status message uses {row - 1} even though row is already 1-based and incremented at the top of the loop. This will display 0/... for the first processed file and stay off-by-one thereafter. Use row (or better, the processed counter i) consistently.
| self.dataset.update_status(f"Processing {media_archive_type} file {row - 1:,}/{max_processed:,} " | |
| self.dataset.update_status(f"Processing {media_archive_type} file {row:,}/{max_processed:,} " |
| @@ -128,14 +130,16 @@ def generate_text( | |||
| system_prompt: Optional[str] = None, | |||
| temperature: float = 0.1, | |||
| files: Optional[List[Union[str, Path, dict]]] = None, | |||
There was a problem hiding this comment.
files is typed as List[Union[str, Path, dict]], but it’s passed into create_multimodal_content(media_urls=...) which now enforces str URLs only. This mismatch can mislead callers and will raise at runtime for Path/dict entries. Align the type hints/docs with the actual accepted type (e.g., List[str] for URLs) or expand create_multimodal_content to support the advertised unions.
| files: Optional[List[Union[str, Path, dict]]] = None, | |
| files: Optional[List[str]] = None, |
| mime_type = mimetypes.guess_type(url)[0] or "application/octet-stream" | ||
| media_category = mime_type.split("/")[0] # "image", "video", or "audio" | ||
| content.append(self._format_media_block(url=url, mime_type=mime_type, media_category=media_category)) |
There was a problem hiding this comment.
mimetypes.guess_type(url) often returns None for URLs with query strings (e.g. ...jpg?size=...). That falls back to application/octet-stream, which makes media_category become application and will format Anthropic blocks as document instead of image. Consider deriving the MIME type from the URL path component (strip query/fragment) or default unknown URL types to image/* when use_media is intended for images.
| with file_path.open("rb") as f: | ||
| b64_data = base64.b64encode(f.read()).decode("utf-8") |
There was a problem hiding this comment.
f.read() loads the entire media file into memory before base64 encoding. For video/audio archives this can easily be tens/hundreds of MB per item and may OOM the worker. Consider enforcing a maximum file size (with a clear error), downsampling/transcoding upstream, or at least logging and skipping files above a safe threshold.
| "gpt-5.4": { | ||
| "name": "[OpenAI] GPT-5.4", | ||
| "model_card": "https://platform.openai.com/docs/models/gpt-5.4", | ||
| "provider": "openai" |
There was a problem hiding this comment.
Renaming model IDs (e.g. gpt-5.2 → gpt-5.4, gpt-5.2-pro → gpt-5.4-pro, gemini-3.1-flash-lite → ...-preview) is a breaking change for any saved processor configs/datasets referencing the old IDs (they’ll no longer resolve to a provider in LLMAdapter.get_models). Consider keeping the old IDs as aliases (deprecated) or adding a lightweight migration path.
| {% set item = from_datasets[annotation.from_dataset] %} | ||
| {% if item.type in processors %} | ||
| {% set processor_options = processors[item.type].get_options(config=__config) %} | ||
| {% set from_dataset_item = from_datasets[annotation.from_dataset] %} |
There was a problem hiding this comment.
processor_options is only defined when from_dataset_item.type in processors, but it’s used unconditionally later in the loop. If the dataset’s processor type is unknown/unregistered, this will raise an undefined-variable error and break rendering. Initialize processor_options to {} (or guard the later condition with from_dataset_item.type in processors) before iterating options.
| {% set from_dataset_item = from_datasets[annotation.from_dataset] %} | |
| {% set from_dataset_item = from_datasets[annotation.from_dataset] %} | |
| {% set processor_options = {} %} |
| self.dataset.update_status(f"Processing {media_archive_type} files from archive") | ||
| staging_area = self.dataset.get_staging_area() | ||
| row = 0 | ||
| max_processed = limit if limit else self.source_dataset.num_rows |
There was a problem hiding this comment.
max_processed = limit if limit else self.source_dataset.num_rows isn’t capped to the archive size (unlike the CSV/NDJSON branch which uses min(limit, num_rows)). If the user enters a limit larger than the archive, progress/status totals will be misleading. Cap max_processed to self.source_dataset.num_rows for consistent behavior.
| max_processed = limit if limit else self.source_dataset.num_rows | |
| max_processed = min(limit, self.source_dataset.num_rows) if limit else self.source_dataset.num_rows |
| batched_data = {} | ||
| n_batched = 0 | ||
| # Skip 'signature' and 'type' annotations for Google | ||
| if provider == "google" and output_key in (".signature", ".type"): |
There was a problem hiding this comment.
The Google “signature/type” filter won’t match keys produced by flatten_dict({model: output_item}) (those keys are prefixed with the model id, e.g. <model>.extras.signature). As written, these metadata fields will likely still be saved as annotations. Consider filtering via output_key.endswith(".signature") / endswith(".type") (or strip the <model>. prefix before comparing).
| if provider == "google" and output_key in (".signature", ".type"): | |
| if provider == "google" and ( | |
| output_key.endswith(".signature") or output_key.endswith(".type") | |
| ): |
Extends
llm-prompterto work with parent datasets that are media archives (zip files from image downloaders or media imports), not just text-based CSV/NDJSON datasets.common/lib/llm.pycreate_multimodal_content()now acceptsmedia_files(local paths, base64-encoded) alongside existingmedia_urls_format_media_block()— new helper for provider-specific content blocks:imageblocks for images,documentblocks for video/audioinput_audioformat for audioimage_urlwrappergenerate_text()gainsmedia_filesparameter to pass local file pathsprocessors/machine_learning/llm_prompter.pyis_compatible_with()— acceptszipdatasets withmedia_typein(image, video, audio)get_options()— when parent is a media archive:process()— new media archive code path: iterates zip contents, skips metadata files, base64-encodes each media file, sends to LLM viamedia_filesparam. Catches model incompatibility errors (e.g. non-vision model receiving images) with clear user-facing messages.validate_query()— relaxes column bracket requirement for media archives; allows empty user prompt when system prompt is providedAll existing text-based processing behavior is preserved in the
elsebranch. All models and custom model IDs remain available — incompatibility is caught at generation time rather than upfront.Original prompt
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.