data/external/
- enzy/TheData_unpruned.parquet
- from EnzyExtract
- brenda/brenda_2025_1.json
- BRENDA 2025.1
- from https://www.brenda-enzymes.org/download.php
- SabioRK
EnzyExtract 1 See scripts/alarms/alarm_hallucination.py
- scientific notation
- rows: 20890, PMIDs: 3802
- The kcat/km method: parse (kcat), (km), and (kcat/km). If kcat/km does not match up with kcat and km, then something went wrong with scientific notation
- 8692, PMIDs: 1511
- hallucination > 0.5
- at > 0.5, rows: 5720, PMIDS: 806
- repetition
- rows: 9630, PMIDs: 784
EnzyExtract 2
- too many sigfigs ( see scripts/alarms/step2_alarm_sigfig.py)
- 111 pmids, of which most (via manual inspection) look like they are actually correct
- out of distribution: use the intersection of BRENDA+EnzyExtract as "super-reliable". Consider points that are out of distribution.
BRENDA
- do the BRENDA/EnzyExtract correlation plot but color based on those above
- BRENDA/EnzyExtract correlation plot ( see scripts/alarms/alarm_correlation.py)
- at kcat_diff > 1.1: 4510, PMIDs: 1592
- out of distribution: use the intersection of BRENDA+EnzyExtract as "super-reliable". Consider points that are out of distribution.
- Use grand_biblio to add DOI to brenda
EnzyExtract
- scientific notation, 3802 PMIDs
- scientific notation with kcat, km, and kcat/km: 1511 PMIDs
- can give LLM the "calculation" tool`
- verify that the kcat and Km values match what is provided in the image.
- Use the calculation tool to ensure that indeed the purported
- If not, try flipping the signs of all exponents (for instance, 4 x 10^5 to 4 x 10^-5) and try again.
- hallucination threshold > 0.5: 806 PMIDs
- repetition threshold > 0.5: 784 PMIDs
- too many sigfigs: 111 PMIDs, of which most (via manual inspection) look like they are actually correct
BRENDA
- BRENDA/EnzyExtract correlation plot: 1592 PMIDs kcat differs more than 1.1-fold
- out of distribution: use the intersection of BRENDA+EnzyExtract as "super-reliable". Consider points that are out of distribution.
Flags
- abbreviated substrate