Add media file support for LLM prompter by Copilot · Pull Request #580 · digitalmethodsinitiative/4cat

Copilot · 2026-03-09T17:06:11Z

Extends llm-prompter to work with parent datasets that are media archives (zip files from image downloaders or media imports), not just text-based CSV/NDJSON datasets.

`common/lib/llm.py`

create_multimodal_content() now accepts media_files (local paths, base64-encoded) alongside existing media_urls
_format_media_block() — new helper for provider-specific content blocks:
- Anthropic: image blocks for images, document blocks for video/audio
- OpenAI: data URIs for images/video, input_audio format for audio
- Google/others: data URI with image_url wrapper
generate_text() gains media_files parameter to pass local file paths

`processors/machine_learning/llm_prompter.py`

is_compatible_with() — accepts zip datasets with media_type in (image, video, audio)
get_options() — when parent is a media archive:
- Shows media info panel instead of column bracket instructions
- Hides text-only options (column selection, batching, truncation, media URL toggle)
process() — new media archive code path: iterates zip contents, skips metadata files, base64-encodes each media file, sends to LLM via media_files param. Catches model incompatibility errors (e.g. non-vision model receiving images) with clear user-facing messages.
validate_query() — relaxes column bracket requirement for media archives; allows empty user prompt when system prompt is provided

All existing text-based processing behavior is preserved in the else branch. All models and custom model IDs remain available — incompatibility is caught at generation time rather than upfront.

Original prompt

This section details on the original issue you should resolve

<issue_title>Add media support for LLM prompter</issue_title>
<issue_description>Update llm-prompter.py so that it can work with parent datasets that are media files, e.g. as part of zip datasets generated through the download-images.py processor or as a media import.

This mostly works by checking the type of the parent dataset and changing the LLM prompter options and functionality accordingly.

If the parent dataset is a media file (e.g. png) or a zip with media items, llm-prompter.py is still available but option fields for text-based datasets (e.g. column selection) should not be shown. For instance, information on the column brackets is redundant here, since no 'column' data is inserted. Instead, an info UserInput field clearly indicates that media files are attached to the prompt. Also show other crucial information if needed.

llm.py should be updated accordingly to support video, image, and audio prompts.

Clear evaluation and alerts, both in the back-end and front-end (validate_query), should tell the user if there's any issues with the dataset media files.

In principle, the flexible setup of llm-prompter.py should be retained. All current models as well as custom model IDs should be allowed, and it should be caught whether the model can't process this data type on generating results after the fact (e.g. some don't support vision/audio) (unless LangChain has an easy way of checking this, then it can be called in validate_query).</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Add media support for LLM prompter #579

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

…ideo/audio Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>

…defaults, and error detection comments Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>

…taset, where multiple media files belong to one post id

sal-uva · 2026-03-09T20:18:23Z

This should be mergeable. Not all vendors support video and audio, but that's acceptable in my opinion.

LLM prompter is now quite bulky and should be refactored, but maybe that's for another day.

Copilot

Pull request overview

This PR extends the llm-prompter processor to support media-archive parent datasets (ZIPs containing image/video/audio), enabling multimodal prompting using locally extracted media files in addition to existing URL-based media inputs.

Changes:

Add ZIP media-archive compatibility and a dedicated processing path in LLMPrompter (options/UI + iteration + annotation mapping).
Extend LLMAdapter multimodal support to accept local media file paths (base64-encoded) alongside media URLs, with provider-specific formatting.
Minor UI/UX tweaks for annotation rendering and link wrapping, plus marking AudioExtractor ZIP outputs as media_type="audio".

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`common/lib/llm.py`	Adds local `media_files` support for multimodal prompts and provider-specific content block formatting.
`processors/machine_learning/llm_prompter.py`	Adds media-archive dataset compatibility, media-specific options, and a ZIP iteration + LLM prompting path.
`processors/audio/audio_extractor.py`	Marks resulting ZIP datasets as audio media type.
`common/assets/llms.json`	Updates/renames several predefined model IDs and model card links.
`webtool/templates/explorer/item-annotations.html`	Uses `item_id` consistently in DOM ids/classes and avoids variable shadowing.
`webtool/static/css/explorer-annotation-generic.css`	Improves wrapping behavior for long annotation label/link text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T10:35:54Z

processors/machine_learning/llm_prompter.py

+                        self.dataset.log(f"Could not load .metadata.json for annotation mapping: {e}. "
+                                         f"Annotations will use filenames as item IDs.")
+
+                for item in self.source_dataset.iterate_items(staging_area=staging_area, immediately_delete=False):


iterate_items(... immediately_delete=False) extracts all archive members into the staging area and keeps them until cleanup. For large media archives this can cause significant temporary disk usage even though each file is only needed for the current iteration. Consider using the default immediately_delete=True (or explicitly deleting item.file after the LLM call).

Suggested change

for item in self.source_dataset.iterate_items(staging_area=staging_area, immediately_delete=False):

for item in self.source_dataset.iterate_items(staging_area=staging_area):

Copilot · 2026-03-11T10:35:55Z

processors/machine_learning/llm_prompter.py

-                    self.dataset.update_status(f"Generating text at row {row:,}/"
-                                               f"{max_processed:,} with {model}{batch_str}")
-                    # Now finally generate some text!
+                    self.dataset.update_status(f"Processing {media_archive_type} file {row - 1:,}/{max_processed:,} "


The status message uses {row - 1} even though row is already 1-based and incremented at the top of the loop. This will display 0/... for the first processed file and stay off-by-one thereafter. Use row (or better, the processed counter i) consistently.

Suggested change

self.dataset.update_status(f"Processing {media_archive_type} file {row - 1:,}/{max_processed:,} "

self.dataset.update_status(f"Processing {media_archive_type} file {row:,}/{max_processed:,} "

Copilot · 2026-03-11T10:35:55Z

common/lib/llm.py

@@ -128,14 +130,16 @@ def generate_text(
            system_prompt: Optional[str] = None,
            temperature: float = 0.1,
            files: Optional[List[Union[str, Path, dict]]] = None,


files is typed as List[Union[str, Path, dict]], but it’s passed into create_multimodal_content(media_urls=...) which now enforces str URLs only. This mismatch can mislead callers and will raise at runtime for Path/dict entries. Align the type hints/docs with the actual accepted type (e.g., List[str] for URLs) or expand create_multimodal_content to support the advertised unions.

Suggested change

files: Optional[List[Union[str, Path, dict]]] = None,

files: Optional[List[str]] = None,

Copilot · 2026-03-11T10:35:55Z

common/lib/llm.py

+                mime_type = mimetypes.guess_type(url)[0] or "application/octet-stream"
+                media_category = mime_type.split("/")[0]  # "image", "video", or "audio"
+                content.append(self._format_media_block(url=url, mime_type=mime_type, media_category=media_category))


mimetypes.guess_type(url) often returns None for URLs with query strings (e.g. ...jpg?size=...). That falls back to application/octet-stream, which makes media_category become application and will format Anthropic blocks as document instead of image. Consider deriving the MIME type from the URL path component (strip query/fragment) or default unknown URL types to image/* when use_media is intended for images.

Copilot · 2026-03-11T10:35:56Z

common/lib/llm.py

+                with file_path.open("rb") as f:
+                    b64_data = base64.b64encode(f.read()).decode("utf-8")


f.read() loads the entire media file into memory before base64 encoding. For video/audio archives this can easily be tens/hundreds of MB per item and may OOM the worker. Consider enforcing a maximum file size (with a clear error), downsampling/transcoding upstream, or at least logging and skipping files above a safe threshold.

Copilot · 2026-03-11T10:35:56Z

common/assets/llms.json

+    "gpt-5.4": {
+        "name": "[OpenAI] GPT-5.4",
+        "model_card": "https://platform.openai.com/docs/models/gpt-5.4",
        "provider": "openai"


Renaming model IDs (e.g. gpt-5.2 → gpt-5.4, gpt-5.2-pro → gpt-5.4-pro, gemini-3.1-flash-lite → ...-preview) is a breaking change for any saved processor configs/datasets referencing the old IDs (they’ll no longer resolve to a provider in LLMAdapter.get_models). Consider keeping the old IDs as aliases (deprecated) or adding a lightweight migration path.

Copilot · 2026-03-11T10:35:56Z

webtool/templates/explorer/item-annotations.html

-                                {% set item = from_datasets[annotation.from_dataset] %}
-                                {% if item.type in processors %}
-                                    {% set processor_options = processors[item.type].get_options(config=__config) %}
+                                {% set from_dataset_item = from_datasets[annotation.from_dataset] %}


processor_options is only defined when from_dataset_item.type in processors, but it’s used unconditionally later in the loop. If the dataset’s processor type is unknown/unregistered, this will raise an undefined-variable error and break rendering. Initialize processor_options to {} (or guard the later condition with from_dataset_item.type in processors) before iterating options.

Suggested change

{% set from_dataset_item = from_datasets[annotation.from_dataset] %}

{% set from_dataset_item = from_datasets[annotation.from_dataset] %}

{% set processor_options = {} %}

Copilot · 2026-03-11T10:35:56Z

processors/machine_learning/llm_prompter.py

+                self.dataset.update_status(f"Processing {media_archive_type} files from archive")
+                staging_area = self.dataset.get_staging_area()
+                row = 0
+                max_processed = limit if limit else self.source_dataset.num_rows


max_processed = limit if limit else self.source_dataset.num_rows isn’t capped to the archive size (unlike the CSV/NDJSON branch which uses min(limit, num_rows)). If the user enters a limit larger than the archive, progress/status totals will be misleading. Cap max_processed to self.source_dataset.num_rows for consistent behavior.

Suggested change

max_processed = limit if limit else self.source_dataset.num_rows

max_processed = min(limit, self.source_dataset.num_rows) if limit else self.source_dataset.num_rows

Copilot · 2026-03-11T10:35:57Z

processors/machine_learning/llm_prompter.py

-                    batched_data = {}
-                    n_batched = 0
+                                # Skip 'signature' and 'type' annotations for Google
+                                if provider == "google" and output_key in (".signature", ".type"):


The Google “signature/type” filter won’t match keys produced by flatten_dict({model: output_item}) (those keys are prefixed with the model id, e.g. <model>.extras.signature). As written, these metadata fields will likely still be saved as annotations. Consider filtering via output_key.endswith(".signature") / endswith(".type") (or strip the <model>. prefix before comparing).

Suggested change

if provider == "google" and output_key in (".signature", ".type"):

if provider == "google" and (

output_key.endswith(".signature") or output_key.endswith(".type")

):

Initial plan

d415a41

Copilot AI assigned Copilot and sal-uva Mar 9, 2026

Copilot started work on behalf of sal-uva March 9, 2026 17:06 View session

Copilot AI and others added 2 commits March 9, 2026 17:23

Add media support for LLM prompter: handle zip archives with images/v…

cb291e6

…ideo/audio Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>

Address code review: fix filename checks, prompt fallback, MIME type …

8782a59

…defaults, and error detection comments Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add media support for LLM prompter~~ Add media file support for LLM prompter Mar 9, 2026

Copilot finished work on behalf of sal-uva March 9, 2026 17:29

sal-uva added 5 commits March 9, 2026 20:25

feat: update LLMs.json

1857e75

fix: styling on explorer

35c9c97

fix: add media type to audio extractor

815750d

fix: merge annotations from top-dataset->media dataset->llm prompt da…

34d3dbc

…taset, where multiple media files belong to one post id

fix: remove signature outputs for google responses

16acdd0

sal-uva marked this pull request as ready for review March 11, 2026 10:29

Copilot AI review requested due to automatic review settings March 11, 2026 10:29

Copilot started reviewing on behalf of sal-uva March 11, 2026 10:29 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add media file support for LLM prompter#580

Add media file support for LLM prompter#580
Copilot wants to merge 8 commits intomasterfrom
copilot/add-media-support-to-llm-prompter

Copilot AI commented Mar 9, 2026 •

edited

Loading

Uh oh!

sal-uva commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	for item in self.source_dataset.iterate_items(staging_area=staging_area, immediately_delete=False):
	for item in self.source_dataset.iterate_items(staging_area=staging_area):

	self.dataset.update_status(f"Processing {media_archive_type} file {row - 1:,}/{max_processed:,} "
	self.dataset.update_status(f"Processing {media_archive_type} file {row:,}/{max_processed:,} "

	files: Optional[List[Union[str, Path, dict]]] = None,
	files: Optional[List[str]] = None,

		with file_path.open("rb") as f:
		b64_data = base64.b64encode(f.read()).decode("utf-8")

	{% set from_dataset_item = from_datasets[annotation.from_dataset] %}
	{% set from_dataset_item = from_datasets[annotation.from_dataset] %}
	{% set processor_options = {} %}

	max_processed = limit if limit else self.source_dataset.num_rows
	max_processed = min(limit, self.source_dataset.num_rows) if limit else self.source_dataset.num_rows

-                                if provider == "google" and output_key in (".signature", ".type"):
+                                if provider == "google" and (
+                                    output_key.endswith(".signature") or output_key.endswith(".type")
+                                ):

Conversation

Copilot AI commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

common/lib/llm.py

processors/machine_learning/llm_prompter.py

Comments on the Issue (you are @copilot in this section)

Uh oh!

sal-uva commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 9, 2026 •

edited

Loading

`common/lib/llm.py`

`processors/machine_learning/llm_prompter.py`