In this project, we are trying to join structural data to kinetics data and build a new database. However, we found there are too many PDB IDs for each EC number, which complicates the dataset after Cartesian joining.
To reduce the meaningless data in the database, we proposed several ways to solve this problem:
- In BRENDA, we found under one EC number, some kinetics data have unique UniProtKB. By pulling mapping table from UniProtKB, we can get the reference tables for PDB ID - UniProtKB - EC number. Although PDB ID and UniProtKB are many-to-many relationship, the data redundancy will be significantly reduced by using an intermediate key - UniProtKB to join PDB (unique PDB ID) with BRENDA (unique EC).
- For those kinetics data that doesn't have UniProtKB, we can create some algorithms to match kinetics to their structures using fuzzy search (or regular expression search, etc.), and match by organism, substrate, etc.
- An alternative way of matching kinetics to structures is finding the sequence of protein when measuring kinetics in the original paper, and join structural data to kinetics data by sequence similarity. There are 3 steps to do this: first, finding the original paper using PubMed ID; second, searching for the sequence in the paper and store it locally; third, search sequence similarity and assign each kinetics value one unique structure. In this way, we can accurately matching every kinetics to one structure, but obviously, this method takes a lot of time and effort.
The above 3 ways are listed by feasibility, from easy to hard. Ultimately, we want to match as many as kinetics data to structures, and we are always welcoming other alternative ways to solve this issue.
In this project, we are trying to join structural data to kinetics data and build a new database. However, we found there are too many PDB IDs for each EC number, which complicates the dataset after Cartesian joining.
To reduce the meaningless data in the database, we proposed several ways to solve this problem:
The above 3 ways are listed by feasibility, from easy to hard. Ultimately, we want to match as many as kinetics data to structures, and we are always welcoming other alternative ways to solve this issue.