Smoothq is a sensitive overlap detection program that detects overlaps for long but error-prone DNA reads generated by the third generation sequencing technology such as PacBio SMRT and Oxford Nanopore.
The innovation of Smoothq is to convert q-gram to smooth q-gram via CGK-embedding and resampling, which can capture q-gram paris with small edit distance. Using smooth q-gram instead of q-gram as seeds, Smoothq is very sensitive to detect overlaps between DNA reads even when their overlap length is small.
git clone git@github.com:FIGOGO/smoothq.git
cd smoothq
cmake .
makeGeneral usage:
./smoothq input.fasta [options] > overlap.txtA small dataset ecoli1000.fasta is provided for testing purposes:
./smoothq ecoli1000.fasta -t 4 -q 16 > overlap.txtThe size of the q-gram, cgk-embedding and smooth q-gram can be tune by the following:
- -q, default = 14: q-gram size
- -e, default = 35: embedding size
- -m, default = 16: smooth q-gram size
Parameters controlling the number of signatures (sensitivity) can be tuned by the following:
- -f, default = 5e-4: filtering threshold
- -c, default = 5: minimum number of signature match
- -r, default = 0.2: sampling rate
Number of threads used in the algorithm can be tuned by the following:
- -t, default = 16: number of threads
Output format:
- -o, default = 'm4': accepting both 'm4' and 'paf'
Optional input file:
- -i, default = None: overlap detection between two datasets
Our output is similar to BLASR’s M4 format.
- Index of sequence 1 (Index is 1-based)
- Index of sequence 2
- Sequence tags in original fasta file separated by '_'
- Signature density of the overlap
- Direction of sequence 1 (0 = forward, 1 = reversed)
- Overlap start position on sequence 1
- Overlap end position on sequence 1
- Length of sequence 1
- Direction of sequence 2 (0 = forward, 1 = reversed)
- Overlap start position on sequence 2
- Overlap end position on sequence 2
- Length of sequence 2
We can also generate output in PAF format if given option '-o paf'.
We use the xxHash package in smooth q-gram hashing.