Skip to content

Reduce containers overhead #267

@kim-fehl

Description

@kim-fehl

Description of the bug

Currently, there are more than 20 container images in the pipeline totaling ~60 GB of disk usage
I remember that for nf-core modules the recommended principle is "one tool – one container", but here we mostly have local modules and I think some redundancy (caused by incremental development) can be reduced.

IMAGE ID DISK USAGE CONTENT SIZE
SEQERA: anndata2ri_bioconductor-singlecellexperiment_anndata_r-seurat:5fae42aabf7a1c5f 291a48658716 3.38GB 846MB
SEQERA: anndata:0.10.9--1eab54e300e1e584 3471efcc8b48 936MB 231MB
SEQERA: anndata_pyyaml:82c6914e861435f7 7946ce8a97db 1.06GB 261MB
SEQERA: anndata_upsetplot:784e0f450da10178 766c2dff4b54 1.18GB 306MB
SEQERA: bbknn_pyyaml_scanpy:4cf2984722da607f 453b5e1f8972 1.6GB 382MB
SEQERA: bioconductor-celldex_bioconductor-hdf5array_bioconductor-singlecellexperiment_r-yaml:13bf33457e3e7490 fdb9aa052292 2.33GB 634MB
SEQERA: celltypist_scanpy:44b604b24dd4cf33 bfe009b0a96c 1.78GB 431MB
SEQERA: harmonypy_pyyaml_scanpy:f6cc57196369fb1e 0c62b23a31d6 1.63GB 392MB
SEQERA: leidenalg_python-igraph_pyyaml_scanpy:4936fa196b5f4340 8644a451da2a 1.66GB 401MB
SEQERA: liana_pyyaml:776fdd7103df146d 131e6bd9dccb 2.24GB 507MB
SEQERA: multiqc:1.33--ee7739d47738383b abd5751768f8 2.01GB 432MB
SEQERA: pandas:2.2.3--9b034ee33172d809 50da2ef5f060 765MB 190MB
SEQERA: python-igraph_pyyaml_scanpy:cc0304f4731f72f9 8f65ff8a2191 1.66GB 401MB
SEQERA: python_pyyaml_scanpy:b5509a698e9aae25 e0dac9eda4d7 1.85GB 461MB
SEQERA: python_pyyaml_scanpy_scikit-image:750e7b74b6d036e4 e2816307a73f 2.04GB 509MB
SEQERA: pyyaml_scanpy:3c9e9f631f45553d 7ed2839670f9 1.63GB 392MB
SEQERA: pyyaml_scanpy:a3a797e09552fddc 228c2994c5f4 1.86GB 466MB
SEQERA: scanpy_upsetplot:1ce883f3ff369ca8 a91e0a660553 1.67GB 414MB
SEQERA: scvi-tools:1.3.3--df115aabdccb7d6b 551e3b44c383 4.66GB 1.08GB
SEQERA: scvi-tools:1.4.1--47f5b0e6b70fd131 0ac460cb48b1 3.47GB 797MB
nicotru/celda:1d48a68e9d534b2b 3a4f38d26238 2.95GB 759MB
nicotru/scds:7788dbeb87bc7eec e6aac618e327 2.48GB 651MB
nicotru/seurat:b3b12d17271014d9 22f891364efc 3.35GB 853MB
nicotru/soupx:f6297681695fbfcf 222d79287a15 2.82GB 700MB
saditya88/singler:0.0.1 cb267ab7d826 9.13GB 2.64GB

(This issue has been brought to my attention as I rent the server and also pay for disk space 😃)

I asked Codex to analyze repo structure and find some ways to optimize container usage, not touching nf-core/modules and accounting for python version pinning you mentioned. Here's the output:

Implementation Plan

  • Use nf-core module containers as the canonical baseline for overlapping local tool families.
    • Align local SCVITOOLS_SCVI and SCVITOOLS_SCANVI to the same scvi-tools=1.3.3 container/env family already used by vendored SCVITOOLS_SOLO and SCVITOOLS_SCAR.
  • Collapse the local generic scanpy 1.11.5 / 1.11.2 split onto one pinned local baseline that is compatible with the nf-core scrublet stack.
    • Standardize local scanpy modules that only need core scanpy functionality on python=3.12.11, scanpy=1.11.2, pyyaml=6.0.2.
    • Apply the same base version to additive local scanpy envs (neighbors, paga, leiden, harmony, bbknn) while keeping their extra packages.
  • Collapse the local upsetplot fork.
    • Change ADATA_UPSETGENES to use anndata directly instead of scanpy for reading .h5ad.
    • Move ADATA_UPSETGENES and DOUBLET_REMOVAL onto one shared pinned local env: python=3.12.11, anndata=0.12.7, upsetplot=0.9.0.
  • Replace docker.io/saditya88/singler:0.0.1 with a local Dockerfile built from a minimal R/Wave-compatible base containing the actual R dependencies used by singleR.R, including bioconductor-hdf5array and anndataR.

Test Plan

  • Run nf-tests for affected local modules and subworkflows:
    • integrate
    • quality_control
    • doublet_detection
    • celltype_assignment
    • affected local modules under scanpy, scvitools, adata/upsetgenes, doublet_detection/doublet_removal, and celltypes/singler
  • Run pipeline tests with -profile test,docker and -profile test_full,docker.

Does this plan make sense?
You also mentioned that private docker hub images can be replaced with Seqera ones, but Codex thought it's too much for one conservative pass :)

Command used and terminal output

Relevant files

No response

System information

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions