In AlignedContig.cpp (constructor, lines 69-76):
for (size_t j = 0; j < c.Length(); ++j) {
if (c.Type() == 'M' || c.Type() == 'I')
++aligned_coverage[p];
if (c.ConsumesQuery() || c.Type() == 'H')
++p;
}
aligned_coverage is sized to m_seq.length() (line 42), which is set from the longest Sequence() across all alignments in bav (lines 25-34). Since Sequence() returns
l_qseq bases (H-clips excluded per SAM spec), the primary alignment — which has no H-clips — always provides the full contig length L.
ConsumesQuery() returns false for H (htslib BAM_CIGAR_TYPE truth table, sam.h:102: BAM_CHARD_CLIP 0 0), but the explicit || c.Type() == 'H' forces p to advance by the
hard-clip length.
Initial concern: this could push p past the end of the vector, causing OOB writes.
But after deeper analysis, I now think this is intentional and correct. Here is the reasoning:
- The coverage loop runs for all non-secondary alignments (!SecondaryFlag(), line 68), which includes supplementary alignments (flag 0x800 is not caught by
SecondaryFlag() which checks 0x100).
- For supplementary alignments, BWA-MEM uses H-clips where the primary uses S-clips for the same contig bases. Without || c.Type() == 'H', supplementary coverage
would be written to wrong positions
- OOB cannot occur because per SAM spec, for any alignment of a read of length L: query-consuming (M+I+S+=/X) + H = L. Since m_seq.length() = L (from the primary
which has full SEQ), the final p after processing any alignment's CIGAR is exactly L. The last array access is at p = L-1 (the ++p executes after the access), which is
within bounds.
- Removing || c.Type() == 'H' (as I initially considered) would actually introduce a new bug — supplementary alignment coverage would be mapped to position 0 instead
of the correct contig offset.
Could you confirm this is indeed the intended behavior?
In
AlignedContig.cpp(constructor, lines 69-76):aligned_coverage is sized to m_seq.length() (line 42), which is set from the longest Sequence() across all alignments in bav (lines 25-34). Since Sequence() returns
l_qseq bases (H-clips excluded per SAM spec), the primary alignment — which has no H-clips — always provides the full contig length L.
ConsumesQuery() returns false for H (htslib BAM_CIGAR_TYPE truth table, sam.h:102: BAM_CHARD_CLIP 0 0), but the explicit || c.Type() == 'H' forces p to advance by the
hard-clip length.
Initial concern: this could push p past the end of the vector, causing OOB writes.
But after deeper analysis, I now think this is intentional and correct. Here is the reasoning:
SecondaryFlag() which checks 0x100).
would be written to wrong positions
which has full SEQ), the final p after processing any alignment's CIGAR is exactly L. The last array access is at p = L-1 (the ++p executes after the access), which is
within bounds.
of the correct contig offset.
Could you confirm this is indeed the intended behavior?