You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found a bug in get_enzyme_type() method, some PDB files have multiple words for enzyme type splitted by one white space. So I chose the Date as the separator to clip the enzyme type on the first line.
The second bug is in get_general_table() method, some PDB files don't have missing residues parts, so I added if isinstance(Missing, str) or Missing is None: extra condition to deal with this situation when it doesn't have missing residues.
I added one extra column name which is the original sign of atom side by the full atomic name column called Name in the data frame returned by get_atom_hetatm_table() method at the same time I changed the function from join() to merge() when I concatenate two pandas data frames.
Show the atomic table in the file which can be used to merge with atom and hetatm table.
We found that the average time of process random 100 PDB files is 248.84s. So from the formula, we have in the email you sent me, I calculated we need around 5 days to extract all general table and atom and hetatm tabled of 170k PDB files.
I found that there were a couple of PDB files which are None, so I wrote some control flows to raise warnings when the PDB files are None. After searching on the RCSB PDB https://www.rcsb.org/structure/4U20 website, we found that this PDB has a corresponding file, but because of the huge file size, we can't access the PDB format as other files. It only provides PDBx format which is beyond our current class capacity.
I also found there are some PDB files which have a PDB format file but lack some desired fields. I wrote some control flows to handle the issue. When the field we want is a lack in the PDB file, we raise warnings to tell users that this field is missing in this file.
I found another bug in get_name() method when the name in the PDB file has multiple lines to be recorded. So I added additional codes to deal with this situation if ";" in res:. We can test this situation using pdb("100d").get_name() and we get the result DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP*)-R(*G)-3').
We want to write a check function or process, but I have my own opinion, first, I found this is a very hard process, because there are a lot of PDB files and their formats in PDB file are not unified, there is no standard format in these PDB files. So a function or process which can check each entry extracted from these methods is still biased and not accurate. So I suggest we correct and fix bugs when we find them while we use these tables.
The sixth - eighth week
I found a bug in
get_enzyme_type()method, some PDB files have multiple words for enzyme type splitted by one white space. So I chose the Date as the separator to clip the enzyme type on the first line.The second bug is in
get_general_table()method, some PDB files don't have missing residues parts, so I addedif isinstance(Missing, str) or Missing is None:extra condition to deal with this situation when it doesn't have missing residues.I added one extra column
namewhich is the original sign of atom side by the full atomic name column calledNamein the data frame returned byget_atom_hetatm_table()method at the same time I changed the function fromjoin()tomerge()when I concatenate two pandas data frames.Show the atomic table in the file which can be used to merge with atom and hetatm table.
We found that the average time of process random 100 PDB files is 248.84s. So from the formula, we have in the email you sent me, I calculated we need around 5 days to extract all general table and atom and hetatm tabled of 170k PDB files.
I found that there were a couple of PDB files which are None, so I wrote some control flows to raise warnings when the PDB files are
None. After searching on the RCSB PDB https://www.rcsb.org/structure/4U20 website, we found that this PDB has a corresponding file, but because of the huge file size, we can't access the PDB format as other files. It only provides PDBx format which is beyond our current class capacity.I also found there are some PDB files which have a PDB format file but lack some desired fields. I wrote some control flows to handle the issue. When the field we want is a lack in the PDB file, we raise warnings to tell users that this field is missing in this file.
I found another bug in
get_name()method when the name in the PDB file has multiple lines to be recorded. So I added additional codes to deal with this situationif ";" in res:. We can test this situation usingpdb("100d").get_name()and we get the resultDNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP*)-R(*G)-3').We want to write a check function or process, but I have my own opinion, first, I found this is a very hard process, because there are a lot of PDB files and their formats in PDB file are not unified, there is no standard format in these PDB files. So a function or process which can check each entry extracted from these methods is still biased and not accurate. So I suggest we correct and fix bugs when we find them while we use these tables.