Author: Daniel Marrama
PEPMatch is a high-performance Python tool designed to find short peptide sequences within a reference proteome or other large protein sets. It is optimized for speed and flexibility, supporting exact matches, searches with a defined number of residue substitutions (mismatches), and a "best match" mode to find the most likely hit.
As a competition to improve tool performance, we created a benchmarking framework with instructions here.
- Versatile Searching: Find exact matches, matches with a specified tolerance for mismatches, or the single best match for each query peptide.
- Discontinuous Epitope Support: Search for non-contiguous residues in the format
"R377, Q408, Q432, ...". - High Performance: Utilizes an efficient k-mer indexing strategy for rapid searching. The backend is powered by a C-based Hamming distance calculation for optimized mismatch detection.
- Optimized Preprocessing: Employs a two-step process. Proteomes are preprocessed once into a format optimized for the search type (SQLite for exact matching, Pickle for mismatching), making subsequent searches extremely fast.
- Parallel Processing: Built-in support for multicore processing to handle large query sets efficiently.
- Flexible I/O: Accepts queries from FASTA files or Python lists and can output results to multiple formats, including CSV, TSV, XLSX, JSON, or directly as a Polars DataFrame.
pip install pepmatchPEPMatch operates using a two-step workflow:
- Preprocessing: First, the target proteome is processed into an indexed format. This step only needs to be performed once per proteome and k-mer size.
PEPMatchuses SQLite databases for the speed of indexed lookups in exact matching and serialized Python objects (pickle) for the flexibility needed in mismatch searching. - Matching: The user's query peptides are then searched against the preprocessed proteome.
This design ensures that the time-intensive task of parsing and indexing the proteome is separated from the search itself, allowing for rapid and repeated querying.
The tool provides two CLI commands: pepmatch-preprocess and pepmatch-match.
The pepmatch-preprocess command builds the necessary database from your proteome FASTA file.
- For exact matching (0 mismatches), use the
sqlformat. - For mismatch matching, use the
pickleformat.
# Preprocess for an exact match search using 5-mers
pepmatch-preprocess -p human.fasta -k 5 -f sql
# Preprocess for a mismatch search using 3-mers
pepmatch-preprocess -p human.fasta -k 3 -f pickle-p,--proteome(Required): Path to the proteome FASTA file.-k,--kmer_size(Required): The k-mer size to use for indexing.-f,--preprocess_format(Required): The format for the preprocessed database (sqlorpickle).-n: A custom name for the proteome.-P: Path to the directory to save preprocessed files.-g: Path to a gene priority proteome file (UniProt specific 1-1 protein per gene file to prioritize matches later)
The pepmatch-match command runs the search against a preprocessed proteome.
# Find exact matches (-m 0) using the preprocessed 5-mer database
pepmatch-match -q peptides.fasta -p human.fasta -m 0 -k 5
# Find matches with up to 3 mismatches (-m 3) using the 3-mer database
pepmatch-match -q neoepitopes.fasta -p human.fasta -m 3 -k 3-q,--query(Required): Path to the query peptide FASTA file.-p,--proteome_file(Required): Path to the original proteome FASTA file.-m: Maximum number of mismatches allowed (e.g.,0for exact).-k: The k-mer size to use (must match the preprocessed file).-P: Path to the directory containing preprocessed files.-b: Enable "best match" mode.-f: Output format (csv,tsv,xlsx,json). Defaults tocsv.-o: Name of the output file (do not include the file extension, i.e..csv)-v: Disable sequence versioning (e.g. for protein ID P05067.1, ".1" will be removed.)-n: Number of parallel processing jobs (CPU cores) to use.
For more control and integration into other workflows, PEPMatch provides a simple Python API.
from pepmatch import Preprocessor, Matcher
# Preprocess the proteome into a SQLite DB for exact matching
Preprocessor('proteomes/human.fasta').sql_proteome(k=5)
# Initialize the Matcher for an exact search (0 mismatches)
matcher = Matcher(
query='queries/mhc-ligands-test.fasta',
proteome_file='proteomes/human.fasta',
max_mismatches=0,
k=5
)
# Run the search and get results
results_df = matcher.match()from pepmatch import Preprocessor, Matcher
# Preprocess the proteome into pickle files for mismatching
Preprocessor('proteomes/human.fasta').pickle_proteome(k=3)
# Initialize the Matcher to allow up to 3 mismatches
matcher = Matcher(
query='queries/neoepitopes-test.fasta',
proteome_file='proteomes/human.fasta',
max_mismatches=3,
k=3
)
results_df = matcher.match()The best_match mode automatically finds the optimal match for each peptide, trying different k-mer sizes and mismatch thresholds. No manual preprocessing is required.
from pepmatch import Matcher
matcher = Matcher(
query='queries/milk-peptides-test.fasta',
proteome_file='proteomes/human.fasta',
best_match=True
)
results_df = matcher.match()Use the ParallelMatcher class to run searches on multiple CPU cores. The n_jobs parameter specifies the number of cores to use.
from pepmatch import Preprocessor, ParallelMatcher
# Preprocessing is the same
Preprocessor('proteomes/betacoronaviruses.fasta').pickle_proteome(k=3)
# Use ParallelMatcher to search with 4 jobs
parallel_matcher = ParallelMatcher(
query='queries/coronavirus-test.fasta',
proteome_file='proteomes/betacoronaviruses.fasta',
max_mismatches=3,
k=3,
n_jobs=4
)
results_df = parallel_matcher.match()PEPMatch can search for epitopes defined by non-contiguous residues and their positions. Simply provide a query list where each item is a string in the format "A1, B10, C15".
from pepmatch import Matcher
# A list of discontinuous epitopes to find
discontinuous_query = [
"R377, Q408, Q432, H433, F436",
"S2760, V2763, E2773, D2805, T2819"
]
matcher = Matcher(
query=discontinuous_query,
proteome_file='proteomes/sars-cov-2.fasta',
max_mismatches=1 # Allow 1 mismatch among the specified residues
)
results_df = matcher.match()You can specify the output format using the output_format parameter in the Matcher or ParallelMatcher.
dataframe(default for API): Returns a Polars DataFrame.csv(default for CLI): Saves results to a CSV file.tsv: Saves results to a TSV file.xlsx: Saves results to an Excel file.json: Saves results to a JSON file.
To receive a DataFrame from the API, you can either omit the output_format parameter or set it explicitly:
# The match() method will return a Polars DataFrame
df = Matcher(
'queries/neoepitopes-test.fasta',
'proteomes/human.fasta',
max_mismatches=3,
k=3,
output_format='dataframe' # Explicitly request a DataFrame
).match()
print(df.head())If you use PEPMatch in your research, please cite the following paper:
Marrama D, Chronister WD, Westernberg L, et al. PEPMatch: a tool to identify short peptide sequence matches in large sets of proteins. BMC Bioinformatics. 2023;24(1):485. Published 2023 Dec 18. doi:10.1186/s12859-023-05606-4
