Skip to content

hillerlab/softmask

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

softmask

GitHub License

lightweight workflow for soft-masking genomes
The Hiller Lab at the Senckenberg Research Institute

masking . pipeline . us


CAGATGATGATGATGATGATGATGATGAGCTT
█████████████████████░░░░░░░░░░░
CAGATGATGATGATGATGATgatgatgagctt
└────────unique────┘└──repeat──┘

Important

  • Masking: This pipeline is designed to mask genomes with soft-masking not hard-masking. This framework is compatible with our whole-genome alignment chain pipeline make_lastz_chains
  • Scaffold names: Input genome is renamed. See for the specific formatting rules.
  • Inputs accepted: .fasta, .2bit, or .gz.
  • Custom library: Repeat library can be provided under repeat_library in params.json.
  • Container image: We offer a pre-built container image for the whole pipeline as well as individual modules. By default the pipeline runs with ghcr.io/hillerlab/softmask:latest. Additional images can be found at containers and nextflow modules at core.

Usage

Note

Requirements: Nextflow ≥ 25.04.6, Docker or Apptainer, Java.

git clone https://github.com/hillerlab/softmask.git
cd softmask

Edit params.json (set genome, assembly_prefix), then:

# Docker
nextflow run main.nf -params-file params.json -profile docker

# Apptainer / Singularity
nextflow run main.nf -params-file params.json -profile apptainer

Smoke test:

nextflow run main.nf -profile test,apptainer

Note

You can also specify these options directly in params.json.

A helper sh script is provided to run the pipeline on a SLURM cluster. See details below.

Click to expand

Edit the path variables at the top of assets/hpc/do_softmask.sh (cache dir, container image, manifest path), then submit:

sbatch --array=1-<N> do_softmask.sh

Each array task spawns one Nextflow head job that submits all compute as child SLURM jobs.

REPEATMASKER run as SLURM job arrays. Partition routing, array sizes, and resource tiers are documented inline in nextflow.config — edit there to match your cluster.


Output

results/
├── 01_renamed/      *fasta
├── 02_database/     {assembly}.* [ repeatmodeler database ]
├── 03_model/        *.fa 
├── 04_mask/         *.masked/*out/*tbl
├── 05_final/        *.{2bit,fasta,gz}/*.repeats.tsv
└── pipeline_info/    timeline, trace, DAG

Where to edit

File What
params.json Genome paths, alignment settings, checkpoints — per run
nextflow.config Compute resources, profiles, container, SLURM — rarely

About

Lightweight workflow for soft-masking genomes

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors