sixth -eighth week work

## The sixth - eighth week

1. I found a bug in `get_enzyme_type()` method, some PDB files have multiple words for enzyme type splitted by one white space. So I chose the Date as the separator to clip the enzyme type on the first line.

2. The second bug is in `get_general_table()` method, some PDB files don't have missing residues parts, so I added `if isinstance(Missing, str) or Missing is None:` extra condition to deal with this situation when it doesn't have missing residues.

3. I added one extra column `name` which is the original sign of atom side by the full atomic name column called `Name` in the data frame returned by `get_atom_hetatm_table()` method at the same time I changed the function from `join()` to `merge()` when I concatenate two pandas data frames.

4. Show the atomic table in the file which can be used to merge with atom and hetatm table.

5. **We found that the average time of process random 100 PDB files is 248.84s. So from the formula, we have in the email you sent me, I calculated we need around 5 days to extract all general table and atom and hetatm tabled of 170k PDB files.** 

6. I found that there were a couple of PDB files which are None, so I wrote some control flows to raise warnings when the PDB files are `None`. After searching on the RCSB PDB [https://www.rcsb.org/structure/4U20](https://www.rcsb.org/structure/4U20) website, we found that this PDB has a corresponding file, but because of the huge file size, we can't access the PDB format as other files. It only provides PDBx format which is beyond our current class capacity. 

7. I also found there are some PDB files which have a PDB format file but lack some desired fields. I wrote some control flows to handle the issue. When the field we want is a lack in the PDB file, we raise warnings to tell users that this field is missing in this file.

8. I found another bug in `get_name()` method when the name in the PDB file has multiple lines to be recorded. So I added additional codes to deal with this situation `if ";" in res:`. We can test this situation using `pdb("100d").get_name()` and we get the result `DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP*)-R(*G)-3')`.

9. *We want to write a check function or process, but I have my own opinion,* first, I found this is a very hard process, because there are a lot of PDB files and their formats in PDB file are not unified, there is no standard format in these PDB files. So a function or process which can check each entry extracted from these methods is still biased and not accurate. So I suggest we correct and fix bugs when we find them while we use these tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sixth -eighth week work #9

The sixth - eighth week

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sixth -eighth week work #9

Description

The sixth - eighth week

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions