EnzyTransfer is a collection of Python scripts designed to standardize heterogeneous enzyme kinetics datasets (including SABIO-RK, BRENDA, Rhea, SKiD, and RetroBioCat) into a unified EnzymeML-style YAML/JSON format.
The core philosophy of EnzyTransfer is to decouple data sources from data structure:
- Raw Data: Stored in
data/, organized by source folders. - Processing: Source-specific pipelines reside in
data_sources/. - Schema: All converters reference a shared schema (e.g.,
schemas/enzymeml-v2-extended.yaml) to produce a compatible EnzymeML-like structure. - Result: Downstream tools can treat all data sources uniformly, regardless of their origin.
- Multi-source Standardization: Converters for SABIO-RK, BRENDA, Rhea, SKiD, and RetroBioCat.
- Schema-driven Design: Uses a central extended EnzymeML schema for proteins, small molecules, reactions, kinetic parameters, and measurements.
- Mutation & Sequence Enrichment: Tools to fill protein sequences, parse mutation strings, fetch UniProt sequences, and classify wildtype vs. mutant records.
- Reference Utilities: Includes DOI-to-PMID converters and RetroBioCat YAML filtering tools.
- Merge & Deduplicate: Parallel sequence-based comparison and merging across different data sources.
Clone the repository and set up a Python environment:
git clone https://github.com/Xukai-YE/enzytransfer.git
cd enzytransfer
# optional but recommended
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# install core dependencies
pip install pandas pyyaml openpyxl requests
# optional: only if you use the LLM header standardizer
pip install openai.
├── data/ # Raw input data (local, usually ignored by git)
│ ├── Sabio_rk/
│ ├── BRENDA/
│ ├── Rhea/
│ ├── SKID/
│ └── RetroBioCat/
├── data_sources/ # Source-specific and post-processing scripts
│ ├── BRENDA/
│ │ ├── step0_llm.py # Universal LLM-based header standardizer
│ │ └── step1_join.py # BRENDA → EnzymeML (unified format)
│ ├── sabio_rk/
│ │ ├── step0_header.py # (Optional) SABIO header helper
│ │ └── step1_extract.py # SABIO-RK → EnzymeML
│ ├── Rhea/
│ │ └── step1_join_modified.py # Rhea → EnzymeML
│ ├── SKID/
│ │ └── step1_skid.py # SKiD → EnzymeML
│ ├── RetroBioCat/
│ │ ├── step0_pub.py # Filter YAMLs by target PMIDs
│ │ └── step1_test.py # RetroBioCat → EnzymeML
│ ├── doi_pub/
│ │ └── step0.py # DOI → PMID converter
│ ├── mutation/ # Mutation handling pipeline
│ │ ├── step0.py # Fill sequences & classify WT/Mutant
│ │ ├── step1.py # UniProt fetch + mutation retry
│ │ ├── step1_2.py # Pattern-based mutation retry
│ │ └── step2_multi.py # Resolve records with multiple UniProt IDs
│ └── merge/
│ └── merge_sequences.py # Parallel sequence comparison & merge
├── schemas/
│ └── enzymeml-v2-extended.yaml # Central extended EnzymeML schema
├── output/ # Standardized EnzymeML YAML/JSON (local)
├── enzymeml_utils.py # Shared helper module (Must be in PYTHONPATH)
└── requirements.txt # Python dependencies
A typical end-to-end pipeline for SABIO-RK:
-
Prepare raw data
Place your original tables under
data/, for example:data/Sabio_rk/raw_kcat.csv schemas/enzymeml-v2-extended.yaml -
(Optional) Normalize headers with an LLM
python data_sources/BRENDA/step0_llm.py \ --input data/Sabio_rk/raw_kcat.csv \ --schema schemas/enzymeml-v2-extended.yaml \ --output data/Sabio_rk/final_standardized.csv \ --api-key sk-... \ -v
-
Convert SABIO-RK to unified EnzymeML
python data_sources/sabio_rk/step1_extract.py \ --input data/Sabio_rk/final_standardized.csv \ --schema schemas/enzymeml-v2-extended.yaml \ --output-dir output/sabio_rk \ --format yaml
-
(Optional) Run mutation and merge tools
# Normalize sequences and classify wildtype vs mutant python data_sources/mutation/step0.py \ --input output/sabio_rk \ --report-dir mutation_reports \ --ok-dir mutation_exports/ok \ --issues-dir mutation_exports/issues # Retry failed mutations using UniProt sequences python data_sources/mutation/step1.py \ --input mutation_exports/issues \ --cache-dir uniprot_cache # Merge multiple sources, e.g. SABIO-RK + BRENDA python data_sources/merge/merge_sequences.py \ output/sabio_rk \ output/brenda \ -o output/merged \ -j 8
Below are minimal working examples for each main script. For full options and explanations, run:
python path/to/script.py --helpand/or read the docstring at the top of each script.
Script: data_sources/BRENDA/step0_llm.py
Purpose: map arbitrary CSV/Excel column names to schema-defined field names using a large language model.
python data_sources/BRENDA/step0_llm.py \
--input data/Sabio_rk/raw_kcat.csv \
--schema schemas/enzymeml-v2-extended.yaml \
--output data/Sabio_rk/final_standardized.csv \
--api-key sk-... \
-vKey arguments:
--input(required): input file (.csv/.xlsx/.parquet/.json/.tsv).--schema(optional): LinkML-style schema YAML (recommended).--output: output file (default:<input>_standardized.csv).--api-key: OpenAI API key (or setOPENAI_API_KEY).--model: OpenAI model (default:gpt-4o).--temperature: LLM temperature (default:0.0).--max-tokens: max LLM response tokens (default:2500).--encoding: force input encoding (auto-detect if not set).--sep: force CSV separator (auto-detect if not set).--verbose/-v: verbose output.
Script: data_sources/sabio_rk/step1_extract.py
Purpose: convert SABIO-RK data into the same EnzymeML-style format as the BRENDA exporter.
python data_sources/sabio_rk/step1_extract.py \
--input data/Sabio_rk/final_standardized.csv \
--schema schemas/enzymeml-v2-extended.yaml \
--output-dir output/sabio_rk \
--format yaml \
--group-by entryKey arguments:
-
--input(required): input CSV/TSV/Excel file. -
--schema(required): schema YAML file. -
--output-dir(required): directory for EnzymeML output. -
--format:yaml,json, orboth(default:yaml). -
--group-by:entry: one file per SABIO entry.protein_substrate: merge rows with the same protein+substrate.
-
--limit: limit the number of entries to process (default:0, meaning no limit).
Script: data_sources/BRENDA/step1_join.py
Purpose: convert BRENDA CSV exports into unified EnzymeML.
python data_sources/BRENDA/step1_join.py \
--brenda-dir data/BRENDA/ \
--schema schemas/enzymeml-v2-extended.yaml \
--output-dir output/brenda \
--format yaml \
--group-by pairKey arguments:
-
--brenda-dir(required): path to a BRENDA ZIP file or folder with BRENDA CSVs. -
--schema(required): schema YAML file. -
--output-dir(required): directory for EnzymeML output. -
--format:yaml,json, orboth(default:json). -
--group-by: grouping strategy:pairpair-reftn-row
-
--limit: maximum number of groups to process (default:0, meaning no limit).
The script includes several BRENDA-specific normalization rules (for example, kcat units, activator/inhibitor placeholders, reference filtering). See the script docstring for details.
Script: data_sources/Rhea/step1_join_modified.py
Purpose: convert Rhea TSV files into the unified EnzymeML structure, SABIO-RK/BRENDA-compatible.
python data_sources/Rhea/step1_join_modified.py \
--input data/Rhea/ \
--schema schemas/enzymeml-v2-extended.yaml \
--output-dir output/rhea \
--format yaml \
--limit 20000Key arguments:
--input(required): directory containing Rhea TSV files, or a single TSV file (the parent directory will be used).--schema(required): EnzymeML v2 extended schema YAML file.--output-dir(required): directory for EnzymeML files.--format:yaml,json, orboth(default:yaml).--rhea-ids/-r: specific Rhea IDs to process (if omitted, use all IDs present in the TSV file(s)).--limit/-l: maximum number of reactions to process (default:20000; reactions list is truncated to this length).
Script: data_sources/SKID/step1_skid.py
Purpose: convert the SKiD main dataset into EnzymeML, grouped by (UniProt_ID, Substrate, Mutation).
python data_sources/SKID/step1_skid.py \
--input data/SKID/Main_dataset_v1.xlsx \
--schema schemas/enzymeml-v2-extended.yaml \
--output-dir output/skid \
--format yaml \
--output-mode bothKey arguments:
-
--input(required): SKiD Excel file (e.g.Main_dataset_v1.xlsx). -
--schema(required): schema YAML file. -
--output-dir(required): directory for EnzymeML output. -
--format:yaml,json, orboth(default:yaml). -
--output-mode:both: WT and mutant (default).wt_onlymutant_only
-
--limit: limit the number of entries to process (default:0, meaning no limit).
The script:
- Groups rows by
(UniProt_ID, Substrate, Mutation). - Loads the
Unique_substratessheet to enrich substrate information (InChI, InChIKey, external IDs). - Attaches external identifiers and reference metadata.
Script: data_sources/RetroBioCat/step1_test.py
Purpose: convert RetroBioCat activity tables into the unified EnzymeML format.
python data_sources/RetroBioCat/step1_test.py \
--input data/RetroBioCat/trial_activity_data.xlsx \
--schema schemas/enzymeml-v2-extended.yaml \
--output-dir output/retrobiocat \
--format yaml \
--group-by entryKey arguments:
-
--input(required): RetroBioCat Excel file. -
--schema(required): schema YAML file. -
--output-dir(required): directory for EnzymeML output. -
--format:yaml,json, orboth(default:yaml). -
--group-by:entryenzyme_substrate
-
--limit: maximum number of entries to process (default:0, meaning no limit).
Script: data_sources/RetroBioCat/step0_pub.py
Purpose: select YAML files that contain any PMID from a built-in target list.
python data_sources/RetroBioCat/step0_pub.py \
--input-dir output/retrobiocat \
--out-dir output/retrobiocat_filtered \
--moveKey arguments:
--input-dir(required): directory containing.yml/.yamlfiles.--out-dir(required): output directory for filtered files and reports.--move: if set, move files instead of copying.
The script:
-
Recursively scans YAMLs under
--input-dir. -
Extracts PMIDs from multiple fields / URL patterns.
-
Copies/moves matched YAMLs to
by_target_pmids/under--out-dir. -
Writes:
pmid_matches.csvtarget_pmids_not_found.csv
Script: data_sources/doi_pub/step0.py
Purpose: read an Excel file, map DOIs to PMIDs via NCBI, and write a new Excel with an added PMID column.
Usage pattern:
-
Open the script and locate the
if __name__ == "__main__":block. -
Set the parameters there, for example:
input_file = "../../data/RetroBioCat/trial_activity_data.xlsx" output_file = "../../data/RetroBioCat/trial_activity_data_with_pmid.xlsx" your_email = "your_email@example.com" # the column that contains DOIs in your sheet: doi_column = "html_doi"
-
Run:
python data_sources/doi_pub/step0.py
The script prints progress and statistics and writes the updated Excel file.
Mutation-related scripts live in data_sources/mutation/ and operate on EnzymeML YAML files.
Script: data_sources/mutation/step0.py
Purpose: write protein sequences into YAMLs, set variant_type, and classify records as wildtype or mutant. Also generates CSV reports.
python data_sources/mutation/step0.py \
--input output/sabio_rk \
--report-dir mutation_reports \
--ok-dir mutation_exports/ok \
--issues-dir mutation_exports/issuesKey arguments:
--input/-i(required): YAML file or directory (recursively scanned).--report-dir: directory for CSV reports (default:./mutation_reports).--ok-dir: directory for “OK” YAML exports (default:../../output/mutation_exports/ok).--issues-dir: directory for “Issue” YAML exports (default:../../output/mutation_exports/issues).
The script normalizes report_dir, ok_dir, and issues_dir to absolute paths before processing.
Script: data_sources/mutation/step1.py
Purpose: for records with issues, fetch UniProt sequences, retry applying mutations, and update YAMLs.
python data_sources/mutation/step1.py \
--input mutation_exports/issues \
--output mutation_step1_output \
--report mutation_step1_report.csv \
--checkpoint mutation_step1_checkpoint.json \
--cache-dir uniprot_cacheKey arguments:
--input/-i(required): directory containing issue YAML files.--output/-o: output directory for corrected YAMLs (default:./mutation_exports/retry_success).--report/-r: path for retry report CSV (default:./mutation_reports/retry_report.csv).--checkpoint/-c: checkpoint JSON file (default:./mutation_reports/.checkpoint.json).--cache-dir: directory for UniProt sequence cache (default:./uniprot_cache).--rate-limit: minimum seconds between UniProt API requests (default:1.0).--resume: resume from last checkpoint.--clear-checkpoint: clear checkpoint and start fresh.
All paths are normalized to absolute paths before use.
Script: data_sources/mutation/step1_2.py
Purpose: use regex / pattern-based alignment to retry remaining failed mutations with checkpoint support.
python data_sources/mutation/step1_2.py \
--input mutation_exports/issues \
--output mutation_step1_2_output \
--report mutation_step1_2_report.csvKey arguments:
--input/-i(required): directory containing issue YAML files.--output/-o: output directory for corrected YAMLs (default:./mutation_exports/retry_success).--report/-r: path for retry report CSV (default:./mutation_reports/retry_report.csv).--checkpoint/-c: checkpoint JSON file (default:./mutation_reports/.checkpoint.json).--resume: resume from last checkpoint.--clear-checkpoint: clear checkpoint and start fresh.
Script: data_sources/mutation/step2_multi.py
Purpose: handle records where uniprotid includes multiple IDs (separated by ;, ,, |, etc.), test mutations against each candidate, and mark resolved vs unresolved records.
python data_sources/mutation/step2_multi.py \
--input mutation_exports/issues \
--resolved mutation_step2_resolved \
--unresolved mutation_step2_unresolved \
--report mutation_step2_report.csv \
--cache-dir uniprot_cacheKey arguments:
--input/-i(required): directory containing YAML files with multiple UniProt IDs.--resolved: output directory for resolved files (default:./mutation_exports/multi_id_resolved).--unresolved: output directory for unresolved files (default:./mutation_exports/multi_id_unresolved).--report/-r: path for report CSV (default:./mutation_reports/multi_id_report.csv).--checkpoint/-c: checkpoint JSON file (default:./mutation_reports/.checkpoint_multi_id.json).--cache-dir: directory for UniProt sequence cache (default:./uniprot_cache).--rate-limit: minimum seconds between UniProt API requests (default:1.0).--resume: resume from last checkpoint.--clear-checkpoint: clear checkpoint and start fresh.
All paths are normalized to absolute paths before use.
Script: data_sources/merge/merge_sequences.py
Purpose: scan one or more directories of EnzymeML YAML files, compare sequences/metadata, and merge overlapping entries. Uses multiprocessing for speed.
python data_sources/merge/merge_sequences.py \
output/sabio_rk \
output/brenda \
output/rhea \
-o output/merged \
-j 8Key arguments:
directories(positional, required): one or more directories to scan (at least two, unless you pass--allow-single-dir).-o,--output-dir: merged output directory (default:merged_output).-j,--jobs: number of worker processes (default: number of CPU cores).--allow-single-dir: allow using a single directory (for testing / internal duplicate checks).-v,--verbose: verbose output.
-
Adding a new data source
-
Create
data_sources/NEW_SOURCE/. -
Implement
step1_new_source.pythat:- Reads your raw tables under
data/NEW_SOURCE/. - Uses the shared schema (
schemas/enzymeml-v2-extended.yaml). - Constructs EnzymeML-style YAML/JSON consistent with existing
step1_*scripts.
- Reads your raw tables under
-
-
Changing the schema
- Update
schemas/enzymeml-v2-extended.yaml(or your schema file). - Adjust
enzymeml_utilshelpers and any affectedstep1_*scripts.
- Update