Skip to content

Large files and method of parsing BAM #35

Description

@cerebis

BAM files based on real experimental data has shown that large files are very inefficiently parsed using the method currently employed.

As of now, sequences are attended to in a specific order, making determining the offset within the contact matrix trivially easy (summed per outer loop). Unfortunately, the pysam method fetch(seq_name) appears to require much up-front IO that scales with BAM file size, resulting in a significant delay for each invocation. In the case of many reference sequences (possibly due to fragmented WGS assembly) this will become a huge penalty.

Therefore, we will require that a method be implemented which determines contact matrix offset from the predetermined sequence order and the fields available in the BAM file. This should not pose a problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions