BAM files based on real experimental data has shown that large files are very inefficiently parsed using the method currently employed.
As of now, sequences are attended to in a specific order, making determining the offset within the contact matrix trivially easy (summed per outer loop). Unfortunately, the pysam method fetch(seq_name) appears to require much up-front IO that scales with BAM file size, resulting in a significant delay for each invocation. In the case of many reference sequences (possibly due to fragmented WGS assembly) this will become a huge penalty.
Therefore, we will require that a method be implemented which determines contact matrix offset from the predetermined sequence order and the fields available in the BAM file. This should not pose a problem.
BAM files based on real experimental data has shown that large files are very inefficiently parsed using the method currently employed.
As of now, sequences are attended to in a specific order, making determining the offset within the contact matrix trivially easy (summed per outer loop). Unfortunately, the pysam method
fetch(seq_name)appears to require much up-front IO that scales with BAM file size, resulting in a significant delay for each invocation. In the case of many reference sequences (possibly due to fragmented WGS assembly) this will become a huge penalty.Therefore, we will require that a method be implemented which determines contact matrix offset from the predetermined sequence order and the fields available in the BAM file. This should not pose a problem.