Hi,
I have downloaded the visdoc-task from hf and extracted the data and images folders.
For all other vidore tasks that I tested the corpus_size under ./data == the number of images under the ./images
However, for esg_reports_human_labeled_v2 there are 230 images under the images folder, while the two test_*.parquet files under data have 1538 rows combined which matches the number of corpus rows in https://huggingface.co/datasets/vidore/esg_reports_human_labeled_v2.
I was wondering if some of the images are missing?
Context:
I need the jsonl corpus for the collection iterator in Pyserini and parquet image bytes are not json serializable.
I save the images locally and keep the path in the jsonl corpus.
I am cross referencing the images from the dataset with those under ./images just to be sure.
Hi,
I have downloaded the
visdoc-taskfrom hf and extracted the data and images folders.For all other vidore tasks that I tested the corpus_size under
./data== the number of images under the./imagesHowever, for esg_reports_human_labeled_v2 there are 230 images under the images folder, while the two test_*.parquet files under data have 1538 rows combined which matches the number of corpus rows in
https://huggingface.co/datasets/vidore/esg_reports_human_labeled_v2.I was wondering if some of the images are missing?
Context:
I need the jsonl corpus for the collection iterator in Pyserini and parquet image bytes are not json serializable.
I save the images locally and keep the path in the jsonl corpus.
I am cross referencing the images from the dataset with those under
./imagesjust to be sure.