Skip to content

Add Upload annotations processor#578

Merged
sal-uva merged 4 commits intomasterfrom
copilot/add-upload-annotations-processor
Mar 9, 2026
Merged

Add Upload annotations processor#578
sal-uva merged 4 commits intomasterfrom
copilot/add-upload-annotations-processor

Conversation

Copy link
Contributor

Copilot AI commented Mar 9, 2026

New processor allowing users to upload annotations for a dataset via CSV file or pasted text, plus hot-updating of the annotation-fields-list in the UI.

Processor (processors/conversion/upload_annotations.py)

  • Two upload methods: CSV file upload or textarea paste (with configurable separator via requires conditional display)
  • Frontend validation (validate_query): checks data format, max 20 annotation fields, no duplicate labels, and conflicts with existing annotation fields on the dataset
  • File handling: after_create persists the uploaded file to .importing for async processing
  • Processing: validates item IDs against source dataset, skips unmatched IDs, saves annotations in batches of 2500, uses finish_with_warning() for partial success
  • Compatible with top-level CSV/NDJSON datasets
item_id,sentiment,topic
post_123,positive,politics
post_456,negative,sports

Hot-update (fourcat.js + api_tool.py)

  • check-processors API response now includes annotation_fields from the top parent dataset
  • New update_annotation_fields_list() in JS dynamically inserts or appends annotation field badges when a processor completes, without page reload
Original prompt

This section details on the original issue you should resolve

<issue_title>Upload annotations processor</issue_title>
<issue_description>Add a processor that allows to upload new annotations for a dataset.

  • It should be called 'Upload annotations'
  • This lets users:
    • Upload a CSV file with annotations: The first column of the data should contain an item_id, the data in the rest of the columns are interpreted as text input annotations, each of which become their own annotation.
    • Paste annotations in a textarea userinput. Here, item annotations should have their own line, starting with the item id, and then the annotation data. Above the textarea there is an text input that allows users to indicate the separator (default: comma; use requires to show this field only if text-upload is selected).
  • The uploaded data is then evaluated for safety and conformity. If the annotation field already exists, this should be caught, both in the frontend (validate_query(), which then lets the user know their input data should change) and backend (in the processor process(), or another logical place). Not more than 20 annotations per item should be allowed, and it should be clear that comma separates the data.
  • Annotations are then saved to annotations with the correct data. Annotation fields should then be saved to the datasets table appropriately and should not intervene with existing annotations.
  • item_ids that are not present in the dataset are skipped.
  • The dataset.update_status() should show clear updates and if things are skipped / partially failed, this should be shown with finish_with_warning()

As a bonus, make sure that annotation-fields-list in result-details.html gets hot-updated whenever annotation fields are added/changed, without having to reload the page. Make sure to read AGENTS.md.</issue_description>

<agent_instructions>See the first comment of this PR for instructions.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@stijn-uva @sal-uva is this now done with the merging of #428? @sal-uva No it is not, it's still on my to-do list! Let me know if you have any thoughts on the above.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits March 9, 2026 14:06
…-list

Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>
Co-authored-by: sal-uva <10960315+sal-uva@users.noreply.github.com>
Copilot AI changed the title [WIP] Add upload annotations processor for datasets Add Upload annotations processor Mar 9, 2026
@sal-uva sal-uva marked this pull request as ready for review March 9, 2026 18:51
Copilot AI review requested due to automatic review settings March 9, 2026 18:51
@sal-uva sal-uva merged commit 0511192 into master Mar 9, 2026
3 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the "Upload Annotations" feature requested in issue #223, which allows users to upload annotation data for existing datasets via a CSV file or pasted text. Additionally, it adds hot-updating of the annotation fields section in the UI when a processor completes, without requiring a page reload.

Changes:

  • New upload-annotations processor that parses and saves annotation data (including frontend validation via validate_query, file handling via after_create, and async batch processing in process())
  • Hot-update feature for the annotation fields list in the UI: the check-processors API response now includes annotation_fields from the top-parent dataset, and new JS code dynamically updates the displayed annotation badges on processor completion
  • Category reclassification of annotation_metadata.py from "Metrics" to "Conversion"

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
processors/conversion/upload_annotations.py New processor class for uploading annotations via CSV file or pasted text
webtool/views/api_tool.py Adds annotation_fields to the check-processors API response
webtool/static/js/fourcat.js Adds update_annotation_fields_list() function and calls it when a processor finishes
processors/metrics/annotation_metadata.py Changes category from "Metrics" to "Conversion"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Write uploaded annotations as output file
warning = None
if skipped:
warning = "Uploaded %i annotation(s) for %i item(s). Skipped %i row(s) (item ID not found in dataset or empty)." % (saved, total_rows, skipped)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message at line 325 uses total_rows for the second %i placeholder (number of items), but total_rows is the total number of input data rows (including skipped rows), not the count of successfully processed items. The second placeholder represents "how many items were uploaded", which should be len(results) (the number of unique items that had at least one annotation saved).

For example, if the file has 100 rows and 10 were skipped, total_rows is 100, but the message says "Uploaded X annotation(s) for 100 item(s). Skipped 10 row(s)..." — which is misleading, as 90 items (not 100) were actually processed.

The correct variable to use here is len(results) (or a dedicated counter tracking the number of successfully processed items).

Copilot uses AI. Check for mistakes.
Comment on lines +217 to +246
# Parse data
try:
reader = csv.reader(f_handle, delimiter=separator)
header = next(reader, None)
except csv.Error:
self.dataset.finish_with_error("Could not parse the uploaded data. Make sure the correct separator is used between columns.")
return

if not header or len(header) < 2:
self.dataset.finish_with_error(
"Data must have at least two columns: the first for item IDs, and at least one for annotations."
)
return

annotation_labels = [h.strip() for h in header[1:]]

# Check max 20 annotations
if len(annotation_labels) > 20:
self.dataset.finish_with_error("A maximum of 20 annotation fields per upload is allowed.")
return

# Check for duplicate annotation field names in existing annotations
source_dataset = self.source_dataset
existing_labels = source_dataset.get_annotation_field_labels()
conflicts = [label for label in annotation_labels if label in existing_labels]
if conflicts:
self.dataset.finish_with_error(
"Annotation field(s) already exist: %s. Please rename them in your input data." % ", ".join(conflicts)
)
return
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The f_handle file handle is opened at line 206 (or assigned at line 215), but when the csv.Error exception is caught at line 221-223 and the function returns early, f_handle is not closed before the return. The same applies to the validation failure at lines 225-229 and 234-246. This leaks file handles (particularly for upload_method == "file", where a real file is opened). The code should use a try/finally block or a context manager (with statement) around the entire processing section to ensure f_handle is always closed, even on early exit.

Copilot uses AI. Check for mistakes.
Comment on lines +686 to +700
let html = '<div class="fullwidth annotation-fields-list"><dt>Annotations</dt><dd><ul>';
for (let field_id in annotation_fields) {
let label = annotation_fields[field_id].label;
html += '<li><span class="property-badge" data-annotation-id="' + field_id + '"><i class="fa-solid fa-tag"></i> ' + label + '</span></li>';
}
html += '</ul></dd></div>';
card_dl.find('#dataset-result').before(html);
}
} else {
// Update existing section: add any new fields
let ul = annotation_list.find('ul');
for (let field_id in annotation_fields) {
if (ul.find('[data-annotation-id="' + field_id + '"]').length === 0) {
let label = annotation_fields[field_id].label;
ul.append('<li><span class="property-badge" data-annotation-id="' + field_id + '"><i class="fa-solid fa-tag"></i> ' + label + '</span></li>');
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In update_annotation_fields_list, both field_id and label (user-supplied annotation field IDs and labels from uploaded CSV headers) are concatenated directly into HTML strings at lines 689 and 700 without any HTML escaping. A maliciously crafted annotation label like "><script>alert(1)</script> would result in XSS when the hot-update function renders it. The label and field_id values should be HTML-escaped (e.g., using a utility function) before being inserted into the DOM. Alternatively, DOM APIs (like document.createElement, textContent, and setAttribute) should be used instead of string concatenation to build HTML.

Copilot uses AI. Check for mistakes.
Comment on lines +681 to +693
if (annotation_list.length === 0) {
// Section does not exist yet; find the dataset details card and insert after
// the last existing fullwidth div before API Credentials or Processors sections
let card_dl = $('.result-page .card dl');
if (card_dl.length > 0) {
let html = '<div class="fullwidth annotation-fields-list"><dt>Annotations</dt><dd><ul>';
for (let field_id in annotation_fields) {
let label = annotation_fields[field_id].label;
html += '<li><span class="property-badge" data-annotation-id="' + field_id + '"><i class="fa-solid fa-tag"></i> ' + label + '</span></li>';
}
html += '</ul></dd></div>';
card_dl.find('#dataset-result').before(html);
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When update_annotation_fields_list creates a new .annotation-fields-list section (the annotation_list.length === 0 branch), the generated HTML div is missing the data-label-edit-href attribute that the existing server-rendered section in result-details.html (line 134) always includes. The annotation_label.handle click handler at line 971 reads this attribute via find_parent(target, 'div').getAttribute('data-label-edit-href'). Without it, callback_url will be null, and the subsequent fetch(null, ...) call will fail with an invalid URL error — meaning annotation label editing will be broken for any fields added via hot-update before the page is reloaded. The dataset key needs to be made available to the JS function so it can build the correct URL for the data-label-edit-href attribute.

Copilot uses AI. Check for mistakes.
Comment on lines +1308 to +1311
"annotation_fields": {
field_id: {"label": field_data.get("label", "")}
for field_id, field_data in (top_parent.annotation_fields or {}).items()
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAPI schema docstring for the check_processor endpoint does not include the newly added annotation_fields field in the response schema. This leaves the API documentation inaccurate with respect to the actual response structure returned by this endpoint.

Copilot uses AI. Check for mistakes.
@dale-wahl
Copy link
Member

dale-wahl commented Mar 10, 2026

@sal-uva looks like copilot did a review after you reviewed and merged this. The hanging open file and ability to inject code that would be executed by our frontend seem the most important to address.

@sal-uva
Copy link
Member

sal-uva commented Mar 10, 2026

@sal-uva looks like copilot did a review after you reviewed and merged this. The hanging open file and ability to inject code that would be executed by our frontend seem the most important to address.

That's stupid of me. Still learning the Copilot PR flow. I'll check the issues (and definitely fix the JS injection one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upload annotations processor

4 participants