nf-autoannotate is a Nextflow pipeline for annotating a query .h5ad dataset with three independent methods:
- CellTypist
- scANVI
- PanHumanPy / Pan-Human Azimuth
The pipeline writes the method outputs back into the original query AnnData object and produces a cluster-level summary table. It does not create a consensus label.
The workflow is implemented in main.nf, with supporting Python scripts under scripts/.
prepare_scanpy_dataset.py is a utility script for creating local demo reference/query .h5ad files from a Scanpy dataset or an existing AnnData file.
The pipeline stages are:
- Validate the reference and query AnnData files, then write
validation_manifest.jsonandshared_genes.txt. - Train or load a CellTypist model.
- Train or load an scANVI model.
- Annotate the query with CellTypist.
- Annotate the query with scANVI.
- Annotate the query with PanHumanPy.
- Merge method outputs into the query
.h5adand write a cluster annotation summary.
--ref_h5ad: reference AnnData file.--query_h5ad: query AnnData file to annotate.--project_tag: output filename prefix.--ref_label_col: referenceobscolumn containing cell type labels.
Optional reference filtering:
--ref_timepoint_col: referenceobscolumn containing timepoint labels.--ref_timepoints: comma-separated timepoints to keep, for exampleE12,E14.
--ref_timepoint_col and --ref_timepoints must be supplied together.
The pipeline selects model behavior from artifact flags instead of explicit mode flags:
- CellTypist trains from the reference unless
--celltypist_modelis supplied. - scANVI loads
--scanvi_model_dirwhen supplied; otherwise it trains a model, optionally initialized from--scvi_reference_model_dir. - PanHumanPy always annotates the query directly.
Required when using pretrained or external artifacts:
--celltypist_model: required to reuse a pretrained CellTypist model.--scanvi_model_dir: required to reuse a pretrained scANVI model directory.--scvi_reference_model_dir: required only when initializing scANVI training from a pretrained scVI reference model. Do not combine this with--scanvi_model_dir.
CellTypist optional controls:
--celltypist_training_mode:standardordetailed. Default:detailed.--celltypist_balance_cells_per_label: maximum reference cells per label in detailed mode. Default:500.--celltypist_feature_selection: run tutorial-style feature selection in detailed mode. Default:true.--celltypist_feature_selection_top_genes: top genes per label for feature selection. Default:100.--celltypist_feature_selection_max_iter: quick SGD feature-selection iterations. Default:5.--celltypist_max_iter: final CellTypist fit iterations. Default:100.
CellTypist model reuse checks that at least 80% of model genes are present in the query. This catches gene identifier mismatches before silently filling most model features with zeros.
scANVI optional controls:
--batch_key: optional column present in both reference and queryobs; omitted values stay as realNone/null. Use this when batches should be modeled.--scanvi_categorical_covariate_keys: optional comma-separated additional covariate columns present in both objects.--scanvi_hvg_n_top_genes: HVGs for new scVI/scANVI training. Set0to keep all shared genes. Default:2500.--scanvi_query_max_epochs: max epochs for query mapping from a pretrained scVI reference model. Default:100.--unknown_label: unlabeled query category for scANVI training. Default:Unknown.
PanHumanPy optional controls:
--panhumanpy_feature_names_col: optional queryvarcolumn containing gene symbols whenquery.var_namesare not gene symbols.
Output naming and merge behavior:
--outdir: output directory. Default:autoannotate-results-<project_tag>.--scanvi_obs_col: scANVI label column name in output. Default:scanvi_label.--scanvi_score_col: scANVI confidence column name in output. Default:scanvi_confidence.--query_cluster_col: existing queryobscluster column used for cluster summaries. Default:leiden.--compute_missing_clusters: compute Leiden clusters in the merge step if--query_cluster_colis absent. Default:false.
When missing clusters are explicitly computed, the merge step reuses existing neighbors or X_pca if present. Otherwise it runs PCA/neighbors/Leiden on a normalized, log-transformed copy so the final query matrix is not modified.
Parameter defaults live in nextflow.config.
Before model steps run, the pipeline checks that:
- input files exist
ref_label_colexists inref_h5ad.obsbatch_key, if supplied, exists in both reference and queryobs- all
scanvi_categorical_covariate_keys, if supplied, exist in both objects panhumanpy_feature_names_col, if supplied, exists inquery_h5ad.var- reference timepoint filters are valid and keep at least one reference cell
- reference and query cell indices are unique
- reference and query gene indices are unique
- reference and query gene identifiers use the same broad scheme, such as symbols or Ensembl IDs
- reference and query share at least one gene
The merge step also validates that every annotation CSV has unique cell_id values matching query.obs_names exactly. Missing or extra cells now fail the merge instead of becoming silent NaN values.
Published files:
<project_tag>.annotated.h5ad<project_tag>.cluster_annotation_summary.csv
The annotated .h5ad is the query object with additional obs columns.
Method columns copied into query.obs:
celltypist_predicted_label: raw per-cell CellTypist label before majority voting.celltypist_majority_voting: CellTypist label after majority voting.celltypist_confidence: probability for the majority-voting label when available; otherwise the row maximum.<scanvi_obs_col>: scANVI predicted label. Default:scanvi_label.<scanvi_score_col>: maximum scANVI class probability. Default:scanvi_confidence.panhumanpy_full_hierarchical_label: PanHumanPy full hierarchy label.panhumanpy_level_zero_label: PanHumanPy broadest label.panhumanpy_final_level_label: PanHumanPy final selected label.panhumanpy_confidence: PanHumanPy final-level softmax probability.panhumanpy_azimuth_broad,panhumanpy_azimuth_medium,panhumanpy_azimuth_fine: refined PanHumanPy labels when returned.
Comparison columns created in query.obs:
celltypist_scanvi_agree: per-cell boolean;trueonly when both methods have non-missing labels andcelltypist_majority_votingequals the scANVI label.celltypist_scanvi_cluster_agreement_fraction: cluster-level fraction copied onto each cell in the cluster; the fraction of cells wherecelltypist_scanvi_agreeistrue.
The cluster summary is grouped by --query_cluster_col and includes:
<query_cluster_col>: cluster identifier.n_cells: number of cells in the cluster.celltypist_majority_top_label: most frequent CellTypist majority-voting label.celltypist_majority_top_fraction: fraction of non-missing CellTypist labels assigned to that top label.scanvi_top_label: most frequent scANVI label.scanvi_top_fraction: fraction of non-missing scANVI labels assigned to that top label.panhumanpy_top_label: most frequent PanHumanPy final-level label, reported independently.panhumanpy_top_fraction: fraction of non-missing PanHumanPy labels assigned to that top label.celltypist_scanvi_cluster_modal_match: whether the CellTypist and scANVI modal labels match in the cluster.celltypist_scanvi_cluster_agreement_fraction: fraction of cells in the cluster where CellTypist and scANVI agree.
celltypist_scanvi_agree and celltypist_scanvi_cluster_modal_match are intentionally different summaries. A cluster can have matching modal labels while many individual cells disagree. The modal-match boolean is therefore kept in the cluster summary, while the more interpretable agreement fraction is copied to query.obs.
Intermediate files remain in Nextflow work directories, including:
validation_manifest.jsonshared_genes.txtcelltypist_model.pklcelltypist_model_metadata.jsoncelltypist_predictions.csvscanvi_model/scanvi_model_metadata.jsonscanvi_predictions.csvpanhumanpy_predictions.csv
Nextflow timeline, report, and trace files are written under autoannotate-reports-<project_tag>/.
To create local reference and query inputs, run:
python scripts/prepare_scanpy_dataset.py --outdir data/demoBy default this downloads the full PBMC68K object into data/demo/raw/, writes an 80/20 reference/query split, and records split metadata:
data/demo/reference.h5addata/demo/query.h5addata/demo/dataset_split_metadata.json
Use the generated files for a training-mode demo run:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad data/demo/reference.h5ad \
--query_h5ad data/demo/query.h5ad \
--project_tag demo \
--ref_label_col bulk_labels \
--query_cluster_col louvain \
--outdir results/demoThe helper also supports other sources:
python scripts/prepare_scanpy_dataset.py --dataset pbmc68k_reduced --outdir data/pbmc68k_reduced_demo
python scripts/prepare_scanpy_dataset.py --dataset pbmc3k --outdir data/pbmc3k_demo
python scripts/prepare_scanpy_dataset.py --input-h5ad /path/to/input.h5ad --outdir data/custom_demo--dataset pbmc68k is the default and downloads the larger Figshare-hosted PBMC68K object into the selected output directory before splitting it. --dataset pbmc68k_reduced and --dataset pbmc3k use datasets provided directly by Scanpy.
Train CellTypist and scANVI from the reference:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--batch_key donor \
--query_cluster_col leidenReuse a pretrained CellTypist model while training scANVI:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--batch_key donor \
--celltypist_model /path/to/celltypist_model.pklReuse pretrained CellTypist and scANVI artifacts:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--celltypist_model /path/to/celltypist_model.pkl \
--scanvi_model_dir /path/to/scanvi_model_dirInitialize scANVI training from a pretrained scVI reference model:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--batch_key donor \
--scvi_reference_model_dir /path/to/scvi_model_dirAllow the merge step to compute missing Leiden clusters:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--compute_missing_clusters true- Input
.h5adfiles should already be normalized/preprocessed appropriately for the chosen annotation methods. - Gene alignment for CellTypist and scANVI uses the reference-query intersection from validation.
- PanHumanPy receives the original query feature space, optionally using
--panhumanpy_feature_names_colfor symbols. - CellTypist, scANVI, and PanHumanPy outputs are kept separate.
- PanHumanPy is not compared with CellTypist or scANVI because it uses a different reference.