Fitness Mapper: A Semi-Automated Pipeline for DMS Heatmap Generation and Tolerance Analysis

Pipeline to process data from Deep Mutational Scanning experiments.

If you use these scripts, please cite Giacometti et al., biorXiv, 2026: "Interactions of outer membrane lipoproteins E. coli PqiC and P. aeruginosa PA3214 with their MCE protein binding partners, PqiB and PA3213".

This is an semi-automation and adaptation of the pipeline and codes used in DOI: 10.1016/j.jbc.2023.104744 and DOI 10.1038/s41586-025-09990-0.

This project is provided "as is" without warranty of any kind. Use of this software is at your own risk.

Data preparation

Clone this github repository on your machine and create a folder named "raw". The raw folder must contain the following files:

sequence.txt: amino acid sequence of the protein (can end with a star for the last codon). Must be on a single line and should not include the primer.
sequence_<LibID>.txt: single line amino acid sequence for each library - Note: sequence.txt must be an exact concatenation of all of the sequence_<LibID>.txt files. Must be a single line
<LibID>.fa: nucleic acid sequence for each library. This should include the sequence for the primers used when amplifying the DNA sequence, if applicable.
start_codons_<LibID>.txt: first two start codon for each library. Must be on 2 lines, 1 for each 3-letter codon. Primers should not be included in the amino acid sequence.
parameters.sh: bash-like file to declare different variables:
- nTask: max number of CPUs which can be used
- libraries: the names of the libaries. Must match the <LibID> names used in the different filenames
- top_strands_Fwd, top_strands_Rev, bottom_strands_Rev, bottom_strands_Fwd: top and bottom strands, Fwd and Rev for each of the 4 libraries. The order in each list must match the order in the libraries variable
- qThresh: threshold to use on the QMAP score (can be modified after STEP) Example:

#!/bin/bash
nTask=10
declare -a libraries=( LibA LibB LibC LibD )
declare -a top_strands_Fwd=( gtttaactttaagaaggagatatacat cgtattgacgccaacgttatgcaa aacgacggtgacgttgtcacatctt ggttgcaggaatcgtgtttatcgcc )
declare -a top_strands_Rev=( ggcatctggcaaatgaccaaacag aatgtcgaagaactgtgggagcgat agtattctggtcccgtttactaaag tttcaggagaaccgacg )
declare -a bottom_strands_Rev=( ctgtttggtcatttgccagatgcc atcgctcccacagttcttcgacatt ctttagtaaacgggaccagaatact cgtcggttctcctgaaa )
declare -a bottom_strands_Fwd=( atgtatatctccttcttaaagttaaac ttgcataacgttggcgtcaatacg aagatgtgacaacgtcaccgtcgtt ggcgataaacacgattcctgcaacc )
qThreshold=42

Two sub-folders names Selected and Unselected with the fastq file generated by the DMS experiments. File names must contain the matching <LibID> tags above. This tag must come before the R1/R2 read ID.

Note: Use plain text and avoid Rich Text to avoid hidden formatting characters.

Processing

The processing is split into 13 steps, grouped into 5 runs. The runs are bash scripts, some of them calling python scripts. Some have been written to be run on SLURM cluster and are called via sbatch. They would need to be modified if runs are not meant to be run on a cluster.

RUN01 - Pre-processing steps

The first group of steps is run via this command:

./RUN01.sh

It completes the steps described below.

STEP 01:

Prepare the output folders. It should create two folders named Selected and Unselected and copy files needed inside each of them, and run bowtie2-build to build a Bowtie 2 index from the fasta files.

Log file for this step and all subsequent steps are saved in the (Un)selected/log sub-folders.

STEP 02:

Align the sequencing reads from the FASTQ files to the indexed reference genome, and save the results to a SAM file. The code used here assumes a SLURM-managed cluster is used and is submitting 1 job per library per output folder. It uses the bowtie2 -p $1 -x $2 $3 function.

Action after RUN01:

Make sure all jobs submitted to the SLURM queue are done before moving to the next step (it should take about a couple of minutes per job on 10 CPUs), and check *sam files were created in the two output folders. Then, run this command:

./RUN01_analysis.sh

and check the plots saved in the STEP02_bowtie2_plots folder. You should have 4 plots for each library summarizing the output of that run.

Single mate alignment	Concordant alignment	Discordant alignment	Overall aligmment

RUN02 - Reads filtering and quantify confidence in sequencing reads

Run this command:

./RUN02.sh

It completes the steps described below.

STEP 03:

Convert SAM to BAM files (for space and speed efficiency) using the samtools view -Sb tool.

STEP 04:

Filter BAM files to keep only the reads properly paired and save to a new BAM file using the samtools view -b -f 2 tool.

STEP 05:

Calculate the frequency distribution of mapping quality scores, parsing the outputs of the samtools view tool.

Action after RUN02:

Each of the previous steps will save corresponding files (labeled with the ID of the step it comes from) and associated log files. To analyze those outputs, make sure python/3.12.2 is loaded and run this command:

python py_RUN02_STEP05_analysis.py

For each library, plots summarize the read alignment (for Selected and Unselected) in sub-folder STEP04_paired_reads, and another plot in sub-folder STEP05_mapq_plots shows the MAPQ score distributions.
Based on this later plot, adjust if needed the qThresh variable in the raw/parameters.sh file; this value is used in the next step to threshold based on that MAPQ score.

Selected reads alignment	Unselected reads alignment	MAPQ scores

RUN 03 - Data filtering and cleaning

Run this command:

./RUN03.sh

It completes the steps described below.

STEP 06:

Filter out all alignment reads that have a mapping quality (MAPQ) below the specified threshold qThresh using the samtools view -b -q tool.

STEP 07:

Extract the sequence and quality information for Read 1 and Read 2, and write to FASTQ format using the bedtools bamtofastq tool.

STEP 08:

Assemble read 1 and 2 into longer, contiguous sequences by identifying and merging their overlapping regions using pandaseq -B -f $1 -r $2 tool.

STEP 09:

Remove primer sequences from sequencing reads (adapters were already removed from the data received). The code used here assumes a SLURM cluster-managed is used and is submitting 1 job per library per output folder. It uses the cutadapt tool.

Action after RUN03:

Make sure all jobs submitted to the SLURM queue are done before moving to the next step (it may take up to 5 minutes per job on 10 CPUs).

Load a python/3.12.2 environment and run

python py_RUN03_STEP09_analysis.py

Plots are saved in the STEP09_cutadapt_plots subfolder. For each library and Selected/Unselected case, 4 plots are generated showing the distribution of error counts after trimming of the primer segments as well as the expected random matches.

Distribution of error counts (Primer 3, Selected)	Expected random matches (Selected)

RUN 04 - Identify mutations

Run this command:

./RUN04.sh

It completes the steps described below.

STEP 10:

Remove sequences with residue N (unknown / ambiguous nucleotide).

STEP 11:

Count single amino acid variants.

Action after RUN04:

Check the plot saved in the STEP11_adapter_removed folder.

Proportion of sequences filtered by STEP 11	Length distribution of input sequences	Distribution of mutation counts per sequence

Also, a new folder named merged_results with files to use for the final analysis is generated.

RUN 05 - Final normalizations and analysis

This last step will generate the final heatmap and tolerance scores using the files in the merged_results directory. If there are different DMS experiments that need to be averaged, the diflog* files from all outputs will need to be copied or symlinked in a single merged_results folder.

Run this command:

./RUN05.sh

It completes the steps described below.

STEP 12:

Calculate log-transformed fitness scores for amino acid variants by normalizing raw counts by total reads, computing enrichment ratios between selected and unselected conditions, and then subtracting the wild-type fitness score at each codon position.

STEP13:

Compute the final tolerance scores as such:

Analyze input files and identify missing data (NaN, where the raw selected count was 0; Inf when the raw unselected count was 0; Err when both of them were 0 for any given potential mutation)
if several DMS experiments or libraries need to be combined, average their score (if for any given position only 1 experiment has missing data, the data from the other experiments will be averaged)
Compute the modified z-score to identify outliers:
- Using $df\_div$ as the original matrix (sequence in x, mutations in y), and $MedValue$ as the median value of that matrix, compute the MAD (Median Absolute Deviation) using: $$df\_avg = abs( df\_div - MedValue)$$
- Using $MedValueAbs$ as the median value of this $df\_avg$ output, and a factor scale of $z\_Score\_Factor = 0.6745$, compute the modified z-score using: $$df\_zscore = z\_Score\_Factor * (df\_div - MedValue) / MedValueAbs$$
Compute final tolerance scores based on the number and types of amino acid substitutions that are tolerated (using a modified version of the Zvelebil similarity score, Zvelebil et al. 1987, Dewachter et al. 2023):
- For each residue in the sequence and each possible mutation associated, assign a score using the Adjustment Matrix. Mutations expected to have large impact have higher scores.
- Compute for each residue in the sequence the maximum potential score $Tmax$ (sum of associated scores from the Adjustment Matrix, Prediction of functionally important residues
- Compute for each residue in the sequence the real score $Treal$ (sum of associated scores from the Adjustment Matrix for mutations where the z-score is above a certain threshold; we investigate 4 thresholds, -4, -3.5, -3.0 and -2.5
- The final fitness score is the ratio between the score of all mutations with a statistically significant z-score, and the maximum score achievable, $Treal / Tmax$

Outputs generated are saved in the STEP13_final_normalization folder and include:

Raw Fitness scores distribution	Comparison of Tolerance Scores for different thresholds	Tolerance Score distributions histograms

Final normalized tolerance for any given threshold (3.5 in this example)

DMS heatmap
X-axis: residue positions in the WT; Y-axis: all possible amino acid substitutions. Each square represents the average fitness cost of an individual mutation relative to the WT sequence. Mutations that decrease fitness relative to the WT are shown in shades of magenta, while mutations that increase fitness are in shades of green. At the bottom, the final tolerance scores for the different thresholds on the z-score are shown in shades from red(0), yellow to gray (1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fitness Mapper: A Semi-Automated Pipeline for DMS Heatmap Generation and Tolerance Analysis

Data preparation

Processing

RUN01 - Pre-processing steps

STEP 01:

STEP 02:

Action after RUN01:

RUN02 - Reads filtering and quantify confidence in sequencing reads

STEP 03:

STEP 04:

STEP 05:

Action after RUN02:

RUN 03 - Data filtering and cleaning

STEP 06:

STEP 07:

STEP 08:

STEP 09:

Action after RUN03:

RUN 04 - Identify mutations

STEP 10:

STEP 11:

Action after RUN04:

RUN 05 - Final normalizations and analysis

STEP 12:

STEP13:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
STEP02_bowtie2_plots		STEP02_bowtie2_plots
STEP04_paired_reads		STEP04_paired_reads
STEP05_mapq_plots		STEP05_mapq_plots
STEP09_cutadapt_plots		STEP09_cutadapt_plots
STEP11_adapter_removed		STEP11_adapter_removed
STEP13_final_normalization		STEP13_final_normalization
LICENSE		LICENSE
README.md		README.md
RUN01.sh		RUN01.sh
RUN01_analysis.sh		RUN01_analysis.sh
RUN02.sh		RUN02.sh
RUN03.sh		RUN03.sh
RUN04.sh		RUN04.sh
RUN05.sh		RUN05.sh
py_RUN01_STEP02_analysis.py		py_RUN01_STEP02_analysis.py
py_RUN02_STEP05_analysis.py		py_RUN02_STEP05_analysis.py
py_RUN03_STEP09_analysis.py		py_RUN03_STEP09_analysis.py
py_RUN04_STEP10_removeN.py		py_RUN04_STEP10_removeN.py
py_RUN04_STEP11_uniq.py		py_RUN04_STEP11_uniq.py
py_RUN05_STEP12_logfitness.py		py_RUN05_STEP12_logfitness.py
py_RUN05_STEP13_final.py		py_RUN05_STEP13_final.py
sb_RUN01_STEP02_base.sh		sb_RUN01_STEP02_base.sh
sb_RUN03_STEP09_base.sh		sb_RUN03_STEP09_base.sh

Folders and files

Latest commit

History

Repository files navigation

Fitness Mapper: A Semi-Automated Pipeline for DMS Heatmap Generation and Tolerance Analysis

Data preparation

Processing

RUN01 - Pre-processing steps

STEP 01:

STEP 02:

Action after RUN01:

RUN02 - Reads filtering and quantify confidence in sequencing reads

STEP 03:

STEP 04:

STEP 05:

Action after RUN02:

RUN 03 - Data filtering and cleaning

STEP 06:

STEP 07:

STEP 08:

STEP 09:

Action after RUN03:

RUN 04 - Identify mutations

STEP 10:

STEP 11:

Action after RUN04:

RUN 05 - Final normalizations and analysis

STEP 12:

STEP13:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages