-
Notifications
You must be signed in to change notification settings - Fork 2
Sample Classification tools
This is a tool for classifying a single sample.
These tools classify samples in different overlapping categories by inspecting read from BAM file alignments at specified targets. The task uses the read scan approach and is able to process unmapped reads. The configuration file specifies one or more loci, each containing one or more target to be genotyped. A locus essentially consists of one or more regions from which read alignments are extracted for analysis. Targets are nucleotide sequences of any length, and may be concatenated (i.e. multiple nucleotide sites need not be contiguous, they can be strung together). Each sample is analyzed individually, and then the results can be aggregated. Invoke as follows:
java -Xms512m -Xmx2000m 'org.cggh.bam.sampleClass.SampleClassAnalysis$SingleSample' \
<CONFIG_FILE> <BATCH_ID> <SAMPLE_ID> <BAM_FILE> <REF_FASTA_FILE> <OUT_DIR>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
- CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Sample Class configuration file”
- BATCH_ID The identifier of the batch the sample belongs to; it is just a string (e.g. “34567” or "MY_BATCH") used to organize the output files
- SAMPLE_NAME The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE The path of the BAM file for the sample
- REF_FASTA_FILE The path to a FASTA file containing the reference sequence used for alignment.
- OUT_DIR The path to the folder where the output files will be written to.
This is a tool for classifying multiple samples. Each sample is processed separately in a multithreaded job, and the results are then aggregated, producing the same outputs as the Genotype Aggregation Tool (see below). Invoke as follows:
java -Xms512m -Xmx2000m 'org.cggh.bam.sampleClass.SampleClassAnalysis$MultiSample' \
<CONFIG_FILE> <SAMPLE_LIST_FILE> <REF_FASTA_FILE> <OUT_DIR>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
- CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Sample Class configuration file”
-
SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed. Each record contains three fields:
- BATCH_ID The identifier of the batch the sample belongs to; it is just a string (e.g. “34567” or "MY_BATCH") used to organize the output files
- SAMPLE_ID The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE The path of the BAM file for the sample
- REF_FASTA_FILE The path to a FASTA file containing the reference sequence used for alignment.
- OUT_DIR The path to the folder where the output files will be written to.
The Sample Classification tools output two tab-separated text files for each sample. These files are placed in an output folder <OUT_DIR>/<SUB> where <SUB> is a string consisting of the first 4 letters of the sample name. For example, file PA0001-C.alleles.tab will be written to folder <OUT_DIR>/PA00. This simple file hashing scheme, which is particularly suited for the MalariaGEN naming scheme, prevents thousands of files accumulating in a single folder, which may cause indexing problems especially on Linux.
-
<SAMPLE_NAME>.classAlleles.tabcontains one line (record) for each distinct class allele at each target. Each record has the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- Locus The name of the target’s locus, as specified in the configuration file
- Target The name of the target, as specified in the configuration file
- Allele The name of the target allele (a class or set of classes)
- Count The number of supporting reads for this allele
- SampleClass The sample class inferred as a result (e.g. the species)
-
<SAMPLE_NAME>.unlistedAlleles.tabcontains one line (record) for each allele identified at each locus that was NOT one of the listed class alleles. This file can be informative for discovering new alleles that should be added to the configuration because they are present in numerous samples. Each record has the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- Locus The name of the target’s locus, as specified in the configuration file
- Target The name of the target, as specified in the configuration file
- Allele The nucleotide sequence of the target allele (result of concatenation, if this is specified in the configuration file)
- Count The number of supporting reads for this allele in this sample
- Proportion The proportion of reads for this allele in this sample
- Closest The class whose target allele is the closest to the reported unlisted allele (i.e. it has fewest sequence differences)
- Diff The number of sequence differences between the unlisted allele and the closest class allele
The Sample Classification Aggregation Tool uses the files produced by the Sample Classification Tool and aggregates them, producing sampleset-wide outputs. The tool also scans the results for multiple targets, and determines whether there is consensus in assigning a class to the sample.
Invoking the Sample Classification Aggregation tool:
java -Xms512m -Xmx2000m 'org.cggh.bam.sampleClass.SampleClassAnalysis$MergeResults' \
<CONFIG_FILE> <SAMPLE_LIST_FILE> <REF_FASTA_FILE> <OUT_DIR>
Note that the qualified class name HAS to be between single quotes. The parameters are as follows:
- CONFIG_FILE This is a configuration file, defining loci and targets, in Java .properties format. The format is specified in section “Sample Class configuration file”
-
SAMPLE_LIST_FILE The path of tab-separated text file, with one line (record) per sample to be processed (note, only the first column is used here) . Each record contains three fields:
- BATCH_ID The identifier of the batch the sample belongs to; it is just a string (e.g. “34567” or "MY_BATCH") used to organize the output files
- SAMPLE_ID The identifier of the sample, e.g. “PA0001-C”
- BAM_FILE The path of the BAM file for the sample
- REF_FASTA_FILE The path to a FASTA file containing the reference sequence used for alignment.
- OUT_DIR The path to the folder where the output files will be written to.
The Sample Classification Aggregation tool outputs several tab-separated text files in output folder <OUT_DIR>
-
AllSamples-AllTargets.classes.tabThis is the main output file, which summarizes the class calls for all samples. Each line (record) shows the results for a sample, with the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- SampleClass The class assigned to the sample. If evidence is found for multiple classes, the class labels are reported as a comma-separated list. If a class label is followed by an asterisk, this means that that class is assigned with evidence but not consensus; i.e. there are some targets where the relevant allele was not observed.
- One column per target, showing the class labels assigned at that locus. If evidence is found for multiple classes, the class labels are reported as a comma-separated list. At some loci, a given allele is assigned to multiple classes (e.g. to P. vivax and P. knowlesi); in this case, the labels are separated by a vertical bar.
-
AllSamples-<LOCUS>_<TARGET>.classAlleles.tabThere is one of these files produced per target. It shows the counts of reads supporting each class, as observed in each sample at this target. Each record represents one sample, with the following fields:- Num A sequential record number (can be ignored)
- Sample The identifier of the sample, e.g. “PA0001-C”
- One column per class, containing the number of reads supporting the class in the sample.
- SampleClass The class assigned to the sample. If evidence is found for multiple classes, the class labels are reported as a comma-separated list. If a class label is followed by an asterisk, this means that that class is assigned with evidence but not consensus; i.e. there are some targets where the relevant allele was not observed.
-
AllSamples-<LOCUS>_<TARGET>.unlistedAlleles.tabThere is one of these files produced per target. It shows the counts of reads supporting each unlisted allele, as observed in each sample at this target. Each record represents one allele in one sample. This is essentially a concatenation of rows from all the<SAMPLE_NAME>.unlistedAlleles.tabfiles. -
AllSamples-UnlistedAllelesStats.tabThis file summarizes all unlisted alleles found, and can be useful to identify major class markers that are not included in the configuration file. It contains on record per allele at a given target, with the following fields:- Num A sequential record number (can be ignored)
- Locus The name of the target’s locus, as specified in the configuration file
- Target The name of the target, as specified in the configuration file
- Allele The nucleotide sequence of the target allele (result of concatenation, if this is specified in the configuration file)
- SampleCount The number of samples in which reads with this allele were found
- MaxReadProp The highest proportion of reads containing this allele in any sample
- MaxReadCount The highest number of reads containing this allele in any sample
- Closest The class whose target allele is the closest to the reported unlisted allele (i.e. it has fewest sequence differences)
- Diff The number of sequence differences between the unlisted allele and the closest class allele
The configuration file for the Sample Classification tools is a file in Java .properties format (a text-based file of name/value pairs). The following properties are specified:
-
sampleClass.genotype.minCallReadCount(integer, default=5) The minimum number of total covering reads required to make a call for a target -
sampleClass.genotype.minAlleleReadCount(integer, default=2) The minimum number of covering reads required to make a call for an allele at a target (e.g. in a het call) -
sampleClass.genotype.minAlleleReadProp(double, default=0.10) The minimum proportion of the total covering reads required to make a call for an allele at a target (e.g. in a het call) -
sampleClass.genotype.minBaseQScore(integer, default=10) The minimum base quality score that must be present at each target position for a given read to be used to call the target's allele -
sampleClass.classesA comma-separated list of class names (e.g. species) that will be used in the analysis -
sampleClass.lociA comma-separated list of locus names that will be used in the analysis -
sampleClass.locus.<LOCUS_NAME>.regionThe span of the alignment from which mapped reads will be extracted. In sample classification, we can extract reads from more than one region, even on different chromosomes (e.g. if there are alternative alignments for different species). Each region is encoded in the form<chr>:<startPos>-<endPos>; multiple regions are separated by commas. -
sampleClass.locus.<LOCUS_NAME>.analyzeUnmappedReadsIf set to 'true' the unmapped reads in the BAM will be searched for anchor sequences and remapped. Default (when property is missing) is 'false'. This has a performance impact, so it should only be set to 'true' at loci thought to be problematic for alignments. -
sampleClass.locus.<LOCUS_NAME>.anchorsA comma-separated list of the anchors (regex) that will be used to match the reads. Each anchor is specified in the format<anchorStartPos>@<regex> -
sampleClass.locus.<LOCUS_NAME>.targetsA comma-separated list of the targets where genotyped alleles will be matched to the alleles of the classes. Each target is specified in the format<targetName>@<spans>, where<spans>is a sequence of one or more nucleotide position intervals which are concatenated to form the target allele; they are specified as<span1StartPos>-<span1EndPos>&<span2StartPos>-<span2EndPos>&... -
sampleClass.locus.<LOCUS_NAME>.target.<CLASS_NAME>.allelesA comma-separated list of alleles that are matched exactly at the target to assign one or more classes, in the format<classList>@<alleles>, where<classList>is one or more class names separated by vertical bar characters, and<alleles>is one or more alleles (produced by concatenation of target spans) separated by vertical bar character.
The following is an example of a configuration file that will inspect two region of the mitochondrion to assign up to 6 species to the sample:
sampleClass.classes=Pf,Pv,Pm,Pow,Poc,Pk
sampleClass.loci=mito1,mito2
#
sampleClass.locus.mito1.region=M76611:520-820
sampleClass.locus.mito1.anchors=651@CCTTACGTACTCTAGCT....ACACAA
sampleClass.locus.mito1.targets=species1@668-671&678-683
sampleClass.locus.mito1.target.species1.alleles=Pf@ATGATTGTCT|ATGATTGTTT,Pv@TTTATATTAT,Pm@TTGTATTAAT,Pow@ATTTACATAA,Poc@ATTTATATAT,Pk@TTTTTATTAT
#
sampleClass.locus.mito2.region=M76611:600-900
sampleClass.locus.mito2.anchors=741@GAATAGAA...GAACTCTATAAATAACCA
sampleClass.locus.mito2.targets=species2@728-733&740-740&749-751&770-773
sampleClass.locus.mito2.target.species2.alleles=Pf@GTTCATTTAAGATT|GTTCATTTAAGACT,Pv|Pk@TATTCATAAATACA,Pm@GTTCAATTAGTACT,Pow|Poc@GTTACAATAATATT