esg_reports_human_labeled_v2: discrepancy between test corpus size and the number of uploaded images

Hi,
I have downloaded the `visdoc-task` from hf and extracted the data and images folders.
For all other vidore tasks that I tested the corpus_size  under `./data` == the number of images under the `./images`
However, for esg_reports_human_labeled_v2 there are 230 images under the images folder, while the two test_*.parquet files under data have 1538 rows combined which matches the number of corpus rows in `https://huggingface.co/datasets/vidore/esg_reports_human_labeled_v2`.
I was wondering if some of the images are missing?


Context:
I need the jsonl corpus for the collection iterator in Pyserini and parquet image bytes are not json serializable.
I save the images locally and keep the path in the jsonl corpus.
I am cross referencing the images from the dataset with those under `./images` just to be sure.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

esg_reports_human_labeled_v2: discrepancy between test corpus size and the number of uploaded images #188

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

esg_reports_human_labeled_v2: discrepancy between test corpus size and the number of uploaded images #188

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions