Skip to content

deepalivasoya/MHCtyping_PacBio

Repository files navigation

MHC Typing Pipeline for PacBio Data

Our initial research into livestock major histocompatibility complex (MHC) diversity involved using the MiSeq platform to sequence hypervariable regions (exons 2 and 3) of MHCI genes. This high-throughput method allowed us to analyze thousands of animals and identify numerous new alleles. However, these short reads didn't provide the complete gene sequences necessary for full functional analysis, protein expression studies, and proper allele nomenclature.

To overcome this limitation, we're now using long-read sequencing, specifically the PacBio platform, to obtain near full-length MHCI genes. This approach provides greater accuracy and is well-suited for resolving the complex and highly polymorphic nature of MHC genes. Our strategy combines our initial MiSeq data with PacBio sequencing in a hybrid approach: MiSeq data is used to screen large cohorts and identify unique haplotypes, allowing us to then focus the more resource-intensive PacBio sequencing on a smaller, more targeted set of individuals. This method ensures we get the most accurate, full-length sequences while remaining cost-effective.

This repository includes the Snakemake pipeline to analyse PacBio data for MHCI genes.

Prerequisites

Before running the pipeline, you need to have the following software installed:

  • Snakemake: A workflow management system.
  • Python 3: With the following libraries:
    • Biopython
    • pandas
    • numpy
  • Cutadapt: A tool for trimming adapters and primers.
  • BLAST+: NCBI's Basic Local Alignment Search Tool.
  • Clustal Omega (clustalo): A multiple sequence alignment program.

Alternatively, you can use the environment.yml file to create and activate a Conda environment before running the analysis.

Setup and Configuration

Directory Structure

Create a main folder with the project name and organise the sub-folders as following:

  • reads: It should have fasta file of PacBio sequences (after quality check) for each sample.
  • fasta:
    • This directory should contain reference databases for your species (e.g., Bovine.MHCI.fasta) which should be indexed for blast. To index the database, use this command: makeblastdb -in <database.fa> -dbtype nucl
    • It should have one sequence in fasta format that is full coding sequence used for ORF extraction.
  • results: This directory will have all output files generated by the pipeline
  • scripts: All the python scripts that will be used by Snakemake and other steps will be here
  • config.yaml: It should be in your main project folder. Check out the attached example config file. You can modify it according to the data you are analysing.
    • Make sure you don’t use sample names that start with numeric digits. Sample names must start with alphabets. If sample names start with digits, the simple trick is to put an alphabet in front of all samples. That will work.
  • Snakefile: It should be in your main project folder.

Configuration File

The pipeline's behavior is controlled by a YAML configuration file. Several example files (config_cattle.yaml, config_equine.yaml, config_sheep.yaml) are provided.

A typical configuration file includes the following parameters:

  • gene_primer: Forward primer sequence.
  • gene_primer_rev_comp: Reverse complement of the gene primer.
  • smart_primer: SMART primer sequence.
  • umi_seq: Unique molecular identifier sequence.
  • length_sd: A numerical threshold for filtering sequences based on their length.
  • miseq_database: Path to the species-specific MHC database.
  • samples: A list of sample identifiers.
  • reads: A dictionary mapping sample IDs to their corresponding input FASTA files (required for config_cattle.yaml).

You must modify the Snakefile to specify the correct configuration file for your analysis.

Pipeline Workflow

The pipeline is managed by a Snakefile and consists of the following main steps:

1. Primer and Adapter Trimming

  • trim_gene_primer: Uses Cutadapt to trim gene-specific primers and their reverse complements.
  • flip_reads: Identifies and reverse-complements reads that were not trimmed in the correct orientation.
  • trim_smart_primer: Trims the SMART primer sequence.
  • trim_umi: Trims the UMI sequence from the reads.

2. ORF Extraction and Filtering

  • get_full_CDS: Uses BLAST to align reads against a fullCDS database and extracts the full open reading frame (ORF).
  • filter_orf: Filters sequences based on their length, keeping only those within a certain range relative to the mean length.

3. Clustering and Comparison

  • cluster: Groups sequences into clusters.
  • mapping: Aligns the clustered sequences against a species-specific MHC database using BLAST.
  • pairwise: Performs a pairwise comparison of clusters to calculate sequence identity.
  • cluster_comparison: Separates clusters into "parents" and "children" and generates consensus sequences. This rule also creates a shell script (consensus.sh) to perform multiple sequence alignments and BLAST comparisons of child/singleton consensus sequences against parent clusters.

Usage

  1. Set the configuration file: In the Snakefile, change the configfile variable to point to your desired YAML file.
    configfile: "config_sheep.yaml"

  2. Run the pipeline: Navigate to the top-level directory and execute the following command:
    snakemake --cores <number_of_cores>

    The pipeline will generate all output files in a results/ directory.

Example: Running with Cattle Data

If your data corresponds to the config_cattle.yaml file, you would ensure that your Snakefile is configured correctly and your input files are in the fastq/ directory. The pipeline will process the samples listed in the samples and reads sections of the config_cattle.yaml file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages