lightweight workflow for soft-masking genomes
The Hiller Lab at the Senckenberg Research Institute
masking .
pipeline .
us
CAGATGATGATGATGATGATGATGATGAGCTT █████████████████████░░░░░░░░░░░ CAGATGATGATGATGATGATgatgatgagctt └────────unique────┘└──repeat──┘
Important
- Masking: This pipeline is designed to mask genomes with soft-masking not hard-masking. This framework is compatible with our whole-genome alignment chain pipeline make_lastz_chains
- Scaffold names: Input genome is renamed. See for the specific formatting rules.
- Inputs accepted:
.fasta,.2bit, or.gz. - Custom library: Repeat library can be provided under
repeat_libraryinparams.json. - Container image: We offer a pre-built container image for the whole pipeline as well as individual modules. By default the pipeline runs with ghcr.io/hillerlab/softmask:latest. Additional images can be found at containers and nextflow modules at core.
Note
Requirements: Nextflow ≥ 25.04.6, Docker or Apptainer, Java.
git clone https://github.com/hillerlab/softmask.git
cd softmaskEdit params.json (set genome, assembly_prefix), then:
# Docker
nextflow run main.nf -params-file params.json -profile docker
# Apptainer / Singularity
nextflow run main.nf -params-file params.json -profile apptainerSmoke test:
nextflow run main.nf -profile test,apptainerNote
You can also specify these options directly in params.json.
A helper sh script is provided to run the pipeline on a SLURM cluster. See details below.
Click to expand
Edit the path variables at the top of assets/hpc/do_softmask.sh (cache dir, container image, manifest path), then submit:
sbatch --array=1-<N> do_softmask.shEach array task spawns one Nextflow head job that submits all compute as child SLURM jobs.
REPEATMASKER run as SLURM job arrays. Partition routing, array sizes, and resource tiers are documented inline in nextflow.config — edit there to match your cluster.
results/
├── 01_renamed/ *fasta
├── 02_database/ {assembly}.* [ repeatmodeler database ]
├── 03_model/ *.fa
├── 04_mask/ *.masked/*out/*tbl
├── 05_final/ *.{2bit,fasta,gz}/*.repeats.tsv
└── pipeline_info/ timeline, trace, DAG
| File | What |
|---|---|
params.json |
Genome paths, alignment settings, checkpoints — per run |
nextflow.config |
Compute resources, profiles, container, SLURM — rarely |
