Pipeline to process data from Deep Mutational Scanning experiments.
If you use these scripts, please cite Giacometti et al., biorXiv, 2026: "Interactions of outer membrane lipoproteins E. coli PqiC and P. aeruginosa PA3214 with their MCE protein binding partners, PqiB and PA3213".
This is an semi-automation and adaptation of the pipeline and codes used in DOI: 10.1016/j.jbc.2023.104744 and DOI 10.1038/s41586-025-09990-0.
This project is provided "as is" without warranty of any kind. Use of this software is at your own risk.
Clone this github repository on your machine and create a folder named "raw". The raw folder must contain the following files:
sequence.txt: amino acid sequence of the protein (can end with a star for the last codon). Must be on a single line and should not include the primer.sequence_<LibID>.txt: single line amino acid sequence for each library - Note:sequence.txtmust be an exact concatenation of all of thesequence_<LibID>.txtfiles. Must be a single line<LibID>.fa: nucleic acid sequence for each library. This should include the sequence for the primers used when amplifying the DNA sequence, if applicable.start_codons_<LibID>.txt: first two start codon for each library. Must be on 2 lines, 1 for each 3-letter codon. Primers should not be included in the amino acid sequence.parameters.sh: bash-like file to declare different variables:nTask: max number of CPUs which can be usedlibraries: the names of the libaries. Must match the<LibID>names used in the different filenamestop_strands_Fwd,top_strands_Rev,bottom_strands_Rev,bottom_strands_Fwd: top and bottom strands, Fwd and Rev for each of the 4 libraries. The order in each list must match the order in thelibrariesvariableqThresh: threshold to use on the QMAP score (can be modified after STEP) Example:
#!/bin/bash
nTask=10
declare -a libraries=( LibA LibB LibC LibD )
declare -a top_strands_Fwd=( gtttaactttaagaaggagatatacat cgtattgacgccaacgttatgcaa aacgacggtgacgttgtcacatctt ggttgcaggaatcgtgtttatcgcc )
declare -a top_strands_Rev=( ggcatctggcaaatgaccaaacag aatgtcgaagaactgtgggagcgat agtattctggtcccgtttactaaag tttcaggagaaccgacg )
declare -a bottom_strands_Rev=( ctgtttggtcatttgccagatgcc atcgctcccacagttcttcgacatt ctttagtaaacgggaccagaatact cgtcggttctcctgaaa )
declare -a bottom_strands_Fwd=( atgtatatctccttcttaaagttaaac ttgcataacgttggcgtcaatacg aagatgtgacaacgtcaccgtcgtt ggcgataaacacgattcctgcaacc )
qThreshold=42
- Two sub-folders names
SelectedandUnselectedwith thefastqfile generated by the DMS experiments. File names must contain the matching<LibID>tags above. This tag must come before theR1/R2read ID.
Note: Use plain text and avoid Rich Text to avoid hidden formatting characters.
The processing is split into 13 steps, grouped into 5 runs. The runs are bash scripts, some of them calling python scripts. Some have been written to be run on SLURM cluster and are called via sbatch. They would need to be modified if runs are not meant to be run on a cluster.
The first group of steps is run via this command:
./RUN01.shIt completes the steps described below.
Prepare the output folders. It should create two folders named Selected and Unselected and copy files needed inside each of them, and run bowtie2-build to build a Bowtie 2 index from the fasta files.
Log file for this step and all subsequent steps are saved in the (Un)selected/log sub-folders.
Align the sequencing reads from the FASTQ files to the indexed reference genome, and save the results to a SAM file. The code used here assumes a SLURM-managed cluster is used and is submitting 1 job per library per output folder. It uses the bowtie2 -p $1 -x $2 $3 function.
Make sure all jobs submitted to the SLURM queue are done before moving to the next step (it should take about a couple of minutes per job on 10 CPUs), and check *sam files were created in the two output folders.
Then, run this command:
./RUN01_analysis.shand check the plots saved in the STEP02_bowtie2_plots folder. You should have 4 plots for each library summarizing the output of that run.
| Single mate alignment | Concordant alignment | Discordant alignment | Overall aligmment |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Run this command:
./RUN02.shIt completes the steps described below.
Convert SAM to BAM files (for space and speed efficiency) using the samtools view -Sb tool.
Filter BAM files to keep only the reads properly paired and save to a new BAM file using the samtools view -b -f 2 tool.
Calculate the frequency distribution of mapping quality scores, parsing the outputs of the samtools view tool.
Each of the previous steps will save corresponding files (labeled with the ID of the step it comes from) and associated log files. To analyze those outputs, make sure python/3.12.2 is loaded and run this command:
python py_RUN02_STEP05_analysis.pyFor each library, plots summarize the read alignment (for Selected and Unselected) in sub-folder STEP04_paired_reads, and another plot in sub-folder STEP05_mapq_plots shows the MAPQ score distributions.
Based on this later plot, adjust if needed the qThresh variable in the raw/parameters.sh file; this value is used in the next step to threshold based on that MAPQ score.
| Selected reads alignment | Unselected reads alignment | MAPQ scores |
|---|---|---|
![]() |
![]() |
![]() |
Run this command:
./RUN03.shIt completes the steps described below.
Filter out all alignment reads that have a mapping quality (MAPQ) below the specified threshold qThresh using the samtools view -b -q tool.
Extract the sequence and quality information for Read 1 and Read 2, and write to FASTQ format using the bedtools bamtofastq tool.
Assemble read 1 and 2 into longer, contiguous sequences by identifying and merging their overlapping regions using pandaseq -B -f $1 -r $2 tool.
Remove primer sequences from sequencing reads (adapters were already removed from the data received). The code used here assumes a SLURM cluster-managed is used and is submitting 1 job per library per output folder. It uses the cutadapt tool.
Make sure all jobs submitted to the SLURM queue are done before moving to the next step (it may take up to 5 minutes per job on 10 CPUs).
Load a python/3.12.2 environment and run
python py_RUN03_STEP09_analysis.pyPlots are saved in the STEP09_cutadapt_plots subfolder. For each library and Selected/Unselected case, 4 plots are generated showing the distribution of error counts after trimming of the primer segments as well as the expected random matches.
| Distribution of error counts (Primer 3, Selected) | Expected random matches (Selected) |
|---|---|
![]() |
![]() |
Run this command:
./RUN04.shIt completes the steps described below.
Remove sequences with residue N (unknown / ambiguous nucleotide).
Count single amino acid variants.
Check the plot saved in the STEP11_adapter_removed folder.
| Proportion of sequences filtered by STEP 11 | Length distribution of input sequences | Distribution of mutation counts per sequence |
|---|---|---|
![]() |
![]() |
![]() |
Also, a new folder named merged_results with files to use for the final analysis is generated.
This last step will generate the final heatmap and tolerance scores using the files in the merged_results directory. If there are different DMS experiments that need to be averaged, the diflog* files from all outputs will need to be copied or symlinked in a single merged_results folder.
Run this command:
./RUN05.shIt completes the steps described below.
Calculate log-transformed fitness scores for amino acid variants by normalizing raw counts by total reads, computing enrichment ratios between selected and unselected conditions, and then subtracting the wild-type fitness score at each codon position.
Compute the final tolerance scores as such:
- Analyze input files and identify missing data (NaN, where the raw selected count was 0; Inf when the raw unselected count was 0; Err when both of them were 0 for any given potential mutation)
- if several DMS experiments or libraries need to be combined, average their score (if for any given position only 1 experiment has missing data, the data from the other experiments will be averaged)
- Compute the modified z-score to identify outliers:
- Using
$df\_div$ as the original matrix (sequence in x, mutations in y), and$MedValue$ as the median value of that matrix, compute the MAD (Median Absolute Deviation) using:$$df\_avg = abs( df\_div - MedValue)$$ - Using
$MedValueAbs$ as the median value of this$df\_avg$ output, and a factor scale of$z\_Score\_Factor = 0.6745$ , compute the modified z-score using:$$df\_zscore = z\_Score\_Factor * (df\_div - MedValue) / MedValueAbs$$
- Using
- Compute final tolerance scores based on the number and types of amino acid substitutions that are tolerated (using a modified version of the Zvelebil similarity score, Zvelebil et al. 1987, Dewachter et al. 2023):
- For each residue in the sequence and each possible mutation associated, assign a score using the Adjustment Matrix. Mutations expected to have large impact have higher scores.
- Compute for each residue in the sequence the maximum potential score
$Tmax$ (sum of associated scores from the Adjustment Matrix, Prediction of functionally important residues - Compute for each residue in the sequence the real score
$Treal$ (sum of associated scores from the Adjustment Matrix for mutations where the z-score is above a certain threshold; we investigate 4 thresholds, -4, -3.5, -3.0 and -2.5 - The final fitness score is the ratio between the score of all mutations with a statistically significant z-score, and the maximum score achievable,
$Treal / Tmax$
Outputs generated are saved in the STEP13_final_normalization folder and include:
| Raw Fitness scores distribution | Comparison of Tolerance Scores for different thresholds | Tolerance Score distributions histograms |
|---|---|---|
![]() |
![]() |
![]() |
| Final normalized tolerance for any given threshold (3.5 in this example) |
|---|
![]() |


















