A tool for converting methylation information from BAM files with MM tags (Biomodal) or XM:Z tags (Illumina DRAGEN) to PAT format for methylation analysis.
patformm processes BAM files containing methylation information in MM tags (e.g., MM:Z:C+C.,1,23;) or XM:Z tags (Illumina DRAGEN format) and converts them to PAT format through a two-step process:
- parse_mm_tags: Extracts methylation information from BAM files
- calculate_cpos: Processes methylation patterns and converts to CpG indices
Extracts essential information from BAM files:
- Chromosome
- Start position
- CIGAR string
- MM tag (Biomodal) or XM:Z tag (Illumina DRAGEN)
- Sequence
- Strand information (reverse/forward)
# Parse MM tags (default)
patformm parse_mm_tags --threads 8 -o output.bed input.bam
# Parse XM:Z tags (Illumina DRAGEN)
patformm parse_mm_tags --tag-type XM --threads 8 -o output.bed input.bamProcesses the extracted information to generate PAT format:
- Maps read positions to reference positions using CIGAR strings
- Identifies methylated C positions from MM tags or XM:Z tags
- Converts to CpG indices using a reference CpG map
- Handles both forward and reverse strands
- Supports chunked processing for memory efficiency
- Only processes CpG contexts (Z/z in XM:Z tags); CHG/CHH/unknown contexts are ignored
patformm calculate_cpos --threads 8 --chunk-size 1000000 -o output.pat input.bed- Parses MM tags in format
MM:Z:C+C,<positions> - Tracks methylated C positions in reads
- Handles both forward (C) and reverse (G) strand methylation
- Parses XM:Z tags containing byte-per-base methylation strings
- Only processes CpG contexts (Z/z characters)
- Z = methylated CpG, z = unmethylated CpG
- CHG (X/x), CHH (H/h), and unknown (U/u) contexts are ignored
- Handles both forward and reverse strands correctly
- XM:Z string aligns with read sequence (5' to 3' of read)
Important: Illumina 5-Base Chemistry
- Illumina 5-base methylation uses enzymatic conversion that is OPPOSITE to bisulfite sequencing
- Methylated C (5mC) is converted to T during library prep
- Unmethylated C remains as C (protected from conversion)
- This is the inverse of bisulfite-seq where unmethylated C → T
- The implementation properly accounts for this inverted chemistry when validating bases
- Uses a preloaded CpG reference map (
CpG.bed.hg38.gz) - Maps genomic positions to CpG indices
- Handles deletions and gaps with '.' notation
PAT format with:
chromosome first_cpg_index methylation_pattern count
chr1 1234 CT.C 5
Where:
methylation_pattern: C=methylated, T=unmethylated, .=missing/invalid
- Parallel processing support
- Chunked file processing
- Memory-efficient design
- Temporary file handling in scratch space
- Python 3.8+
- samtools
- wgbstools
- Reference files:
- CpG.bed.hg38.gz (CpG position index)
# Step 1: Parse MM tags (Biomodal)
patformm parse_mm_tags --threads 8 -o sample.bed sample.bam
# Or parse XM:Z tags (Illumina DRAGEN)
patformm parse_mm_tags --tag-type XM --threads 8 -o sample.bed sample.bam
# Step 2: Calculate CpG positions
patformm calculate_cpos --threads 8 --chunk-size 1000000 -o sample.pat sample.bed- Large BAM files are processed in chunks to manage memory usage
- Supports multi-threaded processing for improved performance