Skip to content

Implement new private_api/taxa/unique_peptides endpoint#99

Closed
pverscha wants to merge 2 commits into
developfrom
worktree-feature-unique-peptides
Closed

Implement new private_api/taxa/unique_peptides endpoint#99
pverscha wants to merge 2 commits into
developfrom
worktree-feature-unique-peptides

Conversation

@pverscha

@pverscha pverscha commented Jun 5, 2026

Copy link
Copy Markdown
Member

Add /private_api/taxa/unique_peptides endpoint

Summary

  • Adds a new GET/POST endpoint at /private_api/taxa/unique_peptides (and .json) that computes taxon-unique peptides for a given strain or species.
  • Adds get_proteins_for_taxon() to the database crate: an exact term-query on taxon_id with search_after pagination to correctly handle taxa with more than 10,000 proteins.
  • Adds fancy-regex dependency to support lookahead assertions in user-supplied cleavage patterns (the standard regex crate does not support lookaheads).

Endpoint contract

Request parameters

Parameter Type Required Default Description
taxon_id u32 yes NCBI taxon ID; must be at species or strain rank
cleavage_regex string no [KR](?!P) Regex matching cleavage sites; split occurs after each match (standard tryptic convention)
min_length usize no 5 Minimum peptide length in amino acids

Response

{
  "unique_peptides": ["AAFEDLQSLQDK", "NLFVAKNLR"],
  "total_peptides": 4821,
  "total_unique_peptides": 312
}

total_peptides is the count of deduplicated peptides after digestion and length filtering. total_unique_peptides equals unique_peptides.length.

Error cases:

  • 400 — invalid cleavage_regex
  • 400 — taxon_id not found in the taxon store
  • 400 — taxon_id is not at species or strain rank (message includes the actual rank)

Algorithm:

  1. Validate regex and taxon rank.
  2. Retrieve all proteins for the taxon from OpenSearch (get_proteins_for_taxon).
  3. Cleave each protein sequence at regex match boundaries (matched characters stay with the preceding fragment).
  4. Filter by min_length and deduplicate.
  5. Search the suffix array index for each candidate peptide (tryptic=false, equate_il=false).
  6. A peptide is unique if every matching protein in the index belongs to the target taxon and the result cutoff was not reached.

@pverscha pverscha added the enhancement New feature or request label Jun 5, 2026
@pverscha pverscha changed the title Implement new private_api endpoint that computes the unique peptides … Implement new private_api/taxa/unique_peptides endpoint Jun 5, 2026
@pverscha pverscha changed the title Implement new private_api/taxa/unique_peptides endpoint Implement new private_api/taxa/unique_peptides endpoint Jun 5, 2026
@pverscha

Copy link
Copy Markdown
Member Author

Merged with #101

@pverscha pverscha closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant