Skip to content

esg_reports_human_labeled_v2: discrepancy between test corpus size and the number of uploaded images #188

@sahel-sh

Description

@sahel-sh

Hi,
I have downloaded the visdoc-task from hf and extracted the data and images folders.
For all other vidore tasks that I tested the corpus_size under ./data == the number of images under the ./images
However, for esg_reports_human_labeled_v2 there are 230 images under the images folder, while the two test_*.parquet files under data have 1538 rows combined which matches the number of corpus rows in https://huggingface.co/datasets/vidore/esg_reports_human_labeled_v2.
I was wondering if some of the images are missing?

Context:
I need the jsonl corpus for the collection iterator in Pyserini and parquet image bytes are not json serializable.
I save the images locally and keep the path in the jsonl corpus.
I am cross referencing the images from the dataset with those under ./images just to be sure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions