Skip to content

[BUG or INTENTION?] H-clip CIGAR advances query position in aligned_coverage loop #148

Description

@yslee-genome4me

In AlignedContig.cpp (constructor, lines 69-76):

for (size_t j = 0; j < c.Length(); ++j) {
  if (c.Type() == 'M' || c.Type() == 'I')
    ++aligned_coverage[p];
  if (c.ConsumesQuery() || c.Type() == 'H')
    ++p;
}

aligned_coverage is sized to m_seq.length() (line 42), which is set from the longest Sequence() across all alignments in bav (lines 25-34). Since Sequence() returns
l_qseq bases (H-clips excluded per SAM spec), the primary alignment — which has no H-clips — always provides the full contig length L.

ConsumesQuery() returns false for H (htslib BAM_CIGAR_TYPE truth table, sam.h:102: BAM_CHARD_CLIP 0 0), but the explicit || c.Type() == 'H' forces p to advance by the
hard-clip length.

Initial concern: this could push p past the end of the vector, causing OOB writes.

But after deeper analysis, I now think this is intentional and correct. Here is the reasoning:

  1. The coverage loop runs for all non-secondary alignments (!SecondaryFlag(), line 68), which includes supplementary alignments (flag 0x800 is not caught by
    SecondaryFlag() which checks 0x100).
  2. For supplementary alignments, BWA-MEM uses H-clips where the primary uses S-clips for the same contig bases. Without || c.Type() == 'H', supplementary coverage
    would be written to wrong positions
  3. OOB cannot occur because per SAM spec, for any alignment of a read of length L: query-consuming (M+I+S+=/X) + H = L. Since m_seq.length() = L (from the primary
    which has full SEQ), the final p after processing any alignment's CIGAR is exactly L. The last array access is at p = L-1 (the ++p executes after the access), which is
    within bounds.
  4. Removing || c.Type() == 'H' (as I initially considered) would actually introduce a new bug — supplementary alignment coverage would be mapped to position 0 instead
    of the correct contig offset.

Could you confirm this is indeed the intended behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions