Skip to content

jackieduckie/patformm

Repository files navigation

patformm

A tool for converting methylation information from BAM files with MM tags (Biomodal) or XM:Z tags (Illumina DRAGEN) to PAT format for methylation analysis.

Description

patformm processes BAM files containing methylation information in MM tags (e.g., MM:Z:C+C.,1,23;) or XM:Z tags (Illumina DRAGEN format) and converts them to PAT format through a two-step process:

  1. parse_mm_tags: Extracts methylation information from BAM files
  2. calculate_cpos: Processes methylation patterns and converts to CpG indices

Core Components

parse_mm_tags

Extracts essential information from BAM files:

  • Chromosome
  • Start position
  • CIGAR string
  • MM tag (Biomodal) or XM:Z tag (Illumina DRAGEN)
  • Sequence
  • Strand information (reverse/forward)
# Parse MM tags (default)
patformm parse_mm_tags --threads 8 -o output.bed input.bam

# Parse XM:Z tags (Illumina DRAGEN)
patformm parse_mm_tags --tag-type XM --threads 8 -o output.bed input.bam

calculate_cpos

Processes the extracted information to generate PAT format:

  • Maps read positions to reference positions using CIGAR strings
  • Identifies methylated C positions from MM tags or XM:Z tags
  • Converts to CpG indices using a reference CpG map
  • Handles both forward and reverse strands
  • Supports chunked processing for memory efficiency
  • Only processes CpG contexts (Z/z in XM:Z tags); CHG/CHH/unknown contexts are ignored
patformm calculate_cpos --threads 8 --chunk-size 1000000 -o output.pat input.bed

Implementation Details

MM Tag Processing

  • Parses MM tags in format MM:Z:C+C,<positions>
  • Tracks methylated C positions in reads
  • Handles both forward (C) and reverse (G) strand methylation

XM:Z Tag Processing (Illumina DRAGEN / 5-Base Methylation)

  • Parses XM:Z tags containing byte-per-base methylation strings
  • Only processes CpG contexts (Z/z characters)
  • Z = methylated CpG, z = unmethylated CpG
  • CHG (X/x), CHH (H/h), and unknown (U/u) contexts are ignored
  • Handles both forward and reverse strands correctly
  • XM:Z string aligns with read sequence (5' to 3' of read)

Important: Illumina 5-Base Chemistry

  • Illumina 5-base methylation uses enzymatic conversion that is OPPOSITE to bisulfite sequencing
  • Methylated C (5mC) is converted to T during library prep
  • Unmethylated C remains as C (protected from conversion)
  • This is the inverse of bisulfite-seq where unmethylated C → T
  • The implementation properly accounts for this inverted chemistry when validating bases

CpG Index Mapping

  • Uses a preloaded CpG reference map (CpG.bed.hg38.gz)
  • Maps genomic positions to CpG indices
  • Handles deletions and gaps with '.' notation

Output Format

PAT format with:

chromosome  first_cpg_index  methylation_pattern  count
chr1        1234            CT.C                 5

Where:

  • methylation_pattern: C=methylated, T=unmethylated, .=missing/invalid

Performance Features

  • Parallel processing support
  • Chunked file processing
  • Memory-efficient design
  • Temporary file handling in scratch space

Requirements

  • Python 3.8+
  • samtools
  • wgbstools
  • Reference files:
    • CpG.bed.hg38.gz (CpG position index)

Usage Example

# Step 1: Parse MM tags (Biomodal)
patformm parse_mm_tags --threads 8 -o sample.bed sample.bam

# Or parse XM:Z tags (Illumina DRAGEN)
patformm parse_mm_tags --tag-type XM --threads 8 -o sample.bed sample.bam

# Step 2: Calculate CpG positions
patformm calculate_cpos --threads 8 --chunk-size 1000000 -o sample.pat sample.bed

Notes

  • Large BAM files are processed in chunks to manage memory usage
  • Supports multi-threaded processing for improved performance

About

A tool that convert 5-base bams with MM and XM tags to pat format.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors