Skip to content

ProjecticumDataScience/leptospira_complete

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bash Scripts for Leptospira Genome Assembly

This repository contains a collection of bash scripts used for the genome assembly and annotation of Leptospira using three different strategies: long‑read assembly, paired‑end short‑read assembly, and hybrid assembly.

Repository Structure

  • README.md = Main documentation file
  • leptospira_complete.Rproj = RStudio project file
  • .gitignore = Git ignore rules
  • followup_recommendations = Text file with recommendations for follow-up research
  • scripts/ = Script folder
    • scripts_genome_annotation/ = Genome annotation script folder (scripts are separated by used tool)
    • conda_install_paired.sh = Paired installation script (don't run the whole script)
    • end2end_paired_reads.sh = Paired-end assembly pipeline
    • hybrid.sh = Hybrid assembly pipeline
    • longreads_end2end.sh = Long-read assembly pipeline
    • software_installation_nanopore.sh = Nanopore installation script (don't run the whole script)
  • Speclist/ = Speclist folder
    • conda_specs_genome_annotation/ = folder containing the speclists for the packages used in the genome annotation pipeline
    • conda_specs_illumina/ = folder containing the speclists for the packages used in the paired pipeline
    • conda_specs_nanopore/ = folder containing the speclists for the packages used in the longread pipeline

Installation & Dependencies

Before running the assembly scripts, make sure the following tools are installed:

  • Flye
  • Unicycler
  • FastQC, MultiQC
  • Fastp
  • Rasusa
  • KMC, GenomeScope2
  • CheckM2, GUNC, QUAST, Barrnap
  • GTDB-Tk
  • MOB-suite
  • NanoPlot, Chopper

Before running the genome annotation scripts, make sure the following tools are installed:

  • Prokka
  • AMRFinderPlus
  • PanViTa
  • RGI
  • CRISPRcasFinder
  • VIBRANT
  • FastMLST
  • EggNOG-mapper
  • COG classifier
  • dbCAN

You can install most assembly tools using the provided scripts conda_install_paired.sh and software_installation_nanopore.sh. The installation of the genome annotation tools is provided in the _setup scripts, or for FastMLST and dbCAN in comments above the running script.

The specifications of all packages used in the conda environments can be found in the speclist folder.

Usage Guidelines

Long-Reads Genome Assembly

To use the long‑read assembly script, place the raw Nanopore read data in a folder named 1_reads. You can run the script using the following command:

bash longreads_end2end.sh --nano-raw

Short-reads Genome Assembly

To use the short-read script place the data in the 1_reads folder. After that run the code for the script using the following command:

bash end2end_paired_reads.sh

Hybrid Genome Assembly

For the hybrid assembly be sure that the 3_fastp from the illumina analysis and the 3_chopper from the nanopore analysis folders are in the same directory. Instead of the 1_reads folder this is the input of the script. From here you can run the hybrid script with the following command line:

bash hybrid.sh

Genome Annotation

The scripts in the scripts_genome_annotation folder should be used after the long-reads, short-reads or hybrid assembly is completely finished. These scripts will only work if 10_assemblies_for_analysis is in the same folder as the scripts for the scripts_genome_annotation. The script 11_GenomeAnnotation.sh should be used first and takes .fsa assembly files in the 10_assemblies_for_analysis directory as input. All other annotation scripts can be individually used afterwards, and use the Prokka files in 11_genome_annotation as input.

bash 11_GenomeAnnotation_setup.sh
bash 11_GenomeAnnotation.sh
bash 12_GenomeAnnotationVirulenceAndResistanceGenes_setup.sh
bash 12_GenomeAnnotation_FA_COGclassifier.sh
bash 12_GenomeAnnotation_FA_dbCAN.sh
bash 12_GenomeAnnotation_MGE_CRISPRcasFinder.sh
bash 12_GenomeAnnotation_MGE_VIBRANT.sh
bash 12_GenomeAnnotation_MGE_setup.sh
bash 12_GenomeAnnotation_Moltyping_FastMLST.sh
bash 12_GenomeAnnotation_VaRG_AMRFinderPlus.sh
bash 12_GenomeAnnotation_VaRG_PanViTa.sh
bash 12_GenomeAnnotation_VaRG_RGI.sh
bash 12_GenomeAnnotation_eggnog.sh

Data Organization

Make sure that after running every script you manually organize the data in the right folder.

Workflows

The Short Paired-End Reads Genome Assembly Workflow

  1. Sequencing reads directory and files
    • Check local read files
  2. Raw reads quality assessment
    • FastQC
    • MultiQC
  3. Raw reads trimming, estimation of genome size and downsampling
    • Fastp
    • Estimation of genome size (KMC and GenomeScope) and downsampling (Rasusa)
  4. Trimmed reads quality assessment
    • FastQC
    • MultiQC
  5. De novo assembly
    • Unicycler
  6. Organizing de novo assembly files
  7. Assembly quality assessment
    • CheckM2
    • GUNC
    • QUAST
    • Barrnap
    • Calculation of vertical sequencing coverage
  8. Taxonomic assignment
    • GTDB-Tk
    • TYGS (online)
  9. Plasmids identification
    • MOB-suite
  10. Assignment of contigs to molecules
    • MOB-suite and an in-house script

The Long-Reads Genome Assembly Workflow

  1. Sequencing reads directory and files
    • Check local read files
  2. Raw reads quality assessment
    • NanoPlot
  3. Raw reads trimming and estimation of genome size
    • Chopper
    • Estimation of genome size (KMC and GenomeScope)
  4. Trimmed reads quality assessment
    • NanoPlot
  5. De novo assembly
    • Flye
  6. Organization of de novo assembly files
  7. Assembly quality assessment
    • CheckM2
    • GUNC
    • QUAST
    • Barrnap
    • Calculation of vertical sequencing coverage
  8. Taxonomic assignment
    • GTDB-Tk
  9. Plasmids identification
    • MOB-suite
  10. Assignment of contigs to molecules
    • MOB-suite and an in-house script

The Hybrid Genome Assembly Workflow

  1. Adding the chopper and fastp files
    • Adding the input files
  2. De novo assembly
    • Unicycler
  3. Organization of de novo assembly files
  4. Assembly quality assessment
    • CheckM2
    • GUNC
    • QUAST
    • Barrnap
    • Calculation of vertical sequencing coverage
  5. Taxonomic assignment
    • GTDB-Tk
  6. Plasmids identification
    • MOB-suite
  7. Assignment of contigs to molecules
    • MOB-suite and an in-house script

The Genome Annotation Workflow

  1. Genome Annotation
    • Prokka
  2. Annotation of virulence and resistance genes (VaRG)
    • AMRFinderPlus
    • PanViTa
    • RGI
  3. Annotation of mobile genetic elements (MGE)
    • CRISPRcasFinder
    • VIBRANT
  4. Annotation of molecular typing (Moltyping)
    • FastMLST
  5. Functional annotation (FA)
    • EggNOG-mapper
    • COG classifier
    • dbCAN

About

Repository with combined scripts from both project groups for the Leptospira projecticum, nov 2025 - jan 2026.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages