Bash Scripts for Leptospira Genome Assembly

This repository contains a collection of bash scripts used for the genome assembly and annotation of Leptospira using three different strategies: long‑read assembly, paired‑end short‑read assembly, and hybrid assembly.

Repository Structure

README.md = Main documentation file
leptospira_complete.Rproj = RStudio project file
.gitignore = Git ignore rules
followup_recommendations = Text file with recommendations for follow-up research
scripts/ = Script folder
- scripts_genome_annotation/ = Genome annotation script folder (scripts are separated by used tool)
- conda_install_paired.sh = Paired installation script (don't run the whole script)
- end2end_paired_reads.sh = Paired-end assembly pipeline
- hybrid.sh = Hybrid assembly pipeline
- longreads_end2end.sh = Long-read assembly pipeline
- software_installation_nanopore.sh = Nanopore installation script (don't run the whole script)
Speclist/ = Speclist folder
- conda_specs_genome_annotation/ = folder containing the speclists for the packages used in the genome annotation pipeline
- conda_specs_illumina/ = folder containing the speclists for the packages used in the paired pipeline
- conda_specs_nanopore/ = folder containing the speclists for the packages used in the longread pipeline

Installation & Dependencies

Before running the assembly scripts, make sure the following tools are installed:

Flye
Unicycler
FastQC, MultiQC
Fastp
Rasusa
KMC, GenomeScope2
CheckM2, GUNC, QUAST, Barrnap
GTDB-Tk
MOB-suite
NanoPlot, Chopper

Before running the genome annotation scripts, make sure the following tools are installed:

Prokka
AMRFinderPlus
PanViTa
RGI
CRISPRcasFinder
VIBRANT
FastMLST
EggNOG-mapper
COG classifier
dbCAN

You can install most assembly tools using the provided scripts conda_install_paired.sh and software_installation_nanopore.sh. The installation of the genome annotation tools is provided in the _setup scripts, or for FastMLST and dbCAN in comments above the running script.

The specifications of all packages used in the conda environments can be found in the speclist folder.

Usage Guidelines

Long-Reads Genome Assembly

To use the long‑read assembly script, place the raw Nanopore read data in a folder named 1_reads. You can run the script using the following command:

bash longreads_end2end.sh --nano-raw

Short-reads Genome Assembly

To use the short-read script place the data in the 1_reads folder. After that run the code for the script using the following command:

bash end2end_paired_reads.sh

Hybrid Genome Assembly

For the hybrid assembly be sure that the 3_fastp from the illumina analysis and the 3_chopper from the nanopore analysis folders are in the same directory. Instead of the 1_reads folder this is the input of the script. From here you can run the hybrid script with the following command line:

bash hybrid.sh

Genome Annotation

The scripts in the scripts_genome_annotation folder should be used after the long-reads, short-reads or hybrid assembly is completely finished. These scripts will only work if 10_assemblies_for_analysis is in the same folder as the scripts for the scripts_genome_annotation. The script 11_GenomeAnnotation.sh should be used first and takes .fsa assembly files in the 10_assemblies_for_analysis directory as input. All other annotation scripts can be individually used afterwards, and use the Prokka files in 11_genome_annotation as input.

bash 11_GenomeAnnotation_setup.sh
bash 11_GenomeAnnotation.sh
bash 12_GenomeAnnotationVirulenceAndResistanceGenes_setup.sh
bash 12_GenomeAnnotation_FA_COGclassifier.sh
bash 12_GenomeAnnotation_FA_dbCAN.sh
bash 12_GenomeAnnotation_MGE_CRISPRcasFinder.sh
bash 12_GenomeAnnotation_MGE_VIBRANT.sh
bash 12_GenomeAnnotation_MGE_setup.sh
bash 12_GenomeAnnotation_Moltyping_FastMLST.sh
bash 12_GenomeAnnotation_VaRG_AMRFinderPlus.sh
bash 12_GenomeAnnotation_VaRG_PanViTa.sh
bash 12_GenomeAnnotation_VaRG_RGI.sh
bash 12_GenomeAnnotation_eggnog.sh

Data Organization

Make sure that after running every script you manually organize the data in the right folder.

Workflows

The Short Paired-End Reads Genome Assembly Workflow

Sequencing reads directory and files
- Check local read files
Raw reads quality assessment
- FastQC
- MultiQC
Raw reads trimming, estimation of genome size and downsampling
- Fastp
- Estimation of genome size (KMC and GenomeScope) and downsampling (Rasusa)
Trimmed reads quality assessment
- FastQC
- MultiQC
De novo assembly
- Unicycler
Organizing de novo assembly files
Assembly quality assessment
- CheckM2
- GUNC
- QUAST
- Barrnap
- Calculation of vertical sequencing coverage
Taxonomic assignment
- GTDB-Tk
- TYGS (online)
Plasmids identification
- MOB-suite
Assignment of contigs to molecules
- MOB-suite and an in-house script

The Long-Reads Genome Assembly Workflow

Sequencing reads directory and files
- Check local read files
Raw reads quality assessment
- NanoPlot
Raw reads trimming and estimation of genome size
- Chopper
- Estimation of genome size (KMC and GenomeScope)
Trimmed reads quality assessment
- NanoPlot
De novo assembly
- Flye
Organization of de novo assembly files
Assembly quality assessment
- CheckM2
- GUNC
- QUAST
- Barrnap
- Calculation of vertical sequencing coverage
Taxonomic assignment
- GTDB-Tk
Plasmids identification
- MOB-suite
Assignment of contigs to molecules
- MOB-suite and an in-house script

The Hybrid Genome Assembly Workflow

Adding the chopper and fastp files
- Adding the input files
De novo assembly
- Unicycler
Organization of de novo assembly files
Assembly quality assessment
- CheckM2
- GUNC
- QUAST
- Barrnap
- Calculation of vertical sequencing coverage
Taxonomic assignment
- GTDB-Tk
Plasmids identification
- MOB-suite
Assignment of contigs to molecules
- MOB-suite and an in-house script

The Genome Annotation Workflow

Genome Annotation
- Prokka
Annotation of virulence and resistance genes (VaRG)
- AMRFinderPlus
- PanViTa
- RGI
Annotation of mobile genetic elements (MGE)
- CRISPRcasFinder
- VIBRANT
Annotation of molecular typing (Moltyping)
- FastMLST
Functional annotation (FA)
- EggNOG-mapper
- COG classifier
- dbCAN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bash Scripts for Leptospira Genome Assembly

Repository Structure

Installation & Dependencies

Usage Guidelines

Long-Reads Genome Assembly

Short-reads Genome Assembly

Hybrid Genome Assembly

Genome Annotation

Data Organization

Workflows

The Short Paired-End Reads Genome Assembly Workflow

The Long-Reads Genome Assembly Workflow

The Hybrid Genome Assembly Workflow

The Genome Annotation Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Speclist		Speclist
scripts		scripts
.gitignore		.gitignore
README.md		README.md
followup_recommendations		followup_recommendations
leptospira_complete.Rproj		leptospira_complete.Rproj

Folders and files

Latest commit

History

Repository files navigation

Bash Scripts for Leptospira Genome Assembly

Repository Structure

Installation & Dependencies

Usage Guidelines

Long-Reads Genome Assembly

Short-reads Genome Assembly

Hybrid Genome Assembly

Genome Annotation

Data Organization

Workflows

The Short Paired-End Reads Genome Assembly Workflow

The Long-Reads Genome Assembly Workflow

The Hybrid Genome Assembly Workflow

The Genome Annotation Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages