Skip to content
This repository was archived by the owner on Oct 19, 2024. It is now read-only.
This repository was archived by the owner on Oct 19, 2024. It is now read-only.

improvement: reduce setup time of AbstractWriterCallback #88

Description

@YoniSchirris

Describe the bug
When running inference, AbstractWriterCallback loops over all datasets to construct the _dataset_size dict. This opens a slide from cache several times, which can take 1-3 seconds. For a dataset of 1500 wsis this often takes 20 minutes.

To Reproduce
Run inference on-the-fly (#87) with your data_dir and glob_pattern set up to find many whole-slide images.

Expected behavior
You'll find that after printing the dataset statistics, it takes a long time to start setting up callback workers.

In my case

[2024-06-07 12:24:32,332][ahcore.data.dataset.DlupDataModule][INFO] - Dataset for stage predict has 773079 samples and the following statistics:
 - Mean: 485.30
 - Std: 145.56
 - Min: 48.00
 - Max: 1056.00
[2024-06-07 12:29:30,294][ahcore.callbacks.converters.common][INFO] - Starting worker for TiffConverterCallback

Environment
dlup version: 0.3.38
How installed: unsure
Python version: 3.11.9
Operating System: linux

Quick solution to reduce time by half;
in

for current_dataset in self._total_dataset.datasets: # type: ignore
change

assert current_dataset.slide_image.identifier
self._dataset_sizes[current_dataset.slide_image.identifier] = len(current_dataset)

to

current_dataset_slide_id = current_dataset.slide_image.identifier
assert current_dataset_slide_id
self._dataset_sizes[current_dataset_slide_id] = len(current_dataset)

which will likely reduce the time by half

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions