TABLE OF CONTENTS:
- Introduction
- Using Theseus
- Tools
- Heuristics
- Datasets
- Reporting Bugs and Feature Request
- License
- Authors
Theseus is a fast, optimal and affine-gap Sequence-to-Graph aligner. It leverages the expected high similarity in the aligned data to accelerate computation and reduce the search space compared to other alternatives. Theseus is a general purpose aligner, providing two main functionalities:
- Multiple Sequence Alignment (MSA): Theseus performs MSA of a set of N sequences using the Partial Order Alignment (POA) approach. That is, it progressively builds a partial order graph representing an MSA of a set of given sequences, adding one more sequence to the graph at each iteration.
- Alignment to a reference graph: Alternatively, Theseus can align one or several sequences to a reference graph. The user has to provide an initial position for the alignment to start.
Theseus extends the proposal from the Wavefront Alignment algorithm (WFA), originally devised for pairwise sequence alignment, to the context of sequence-to-graph alignment.
Git clone and compile the library, tools, and examples (by default, use cmake for the library and benchmark build).
git clone https://github.com/albertjimenezbl/theseus-lib
cd theseus-lib
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
This example illustrates how to use Theseus as a Multiple Sequence Aligner. First, include the msa and general headers:
#include "theseus/alignment.h"
#include "theseus/heuristics.h"
#include "theseus/penalties.h"
#include "theseus/theseus_msa_aligner.h"
Then, create and configure a MSA aligner object. This object is defined by three parameters: a set of penalties, a heuristics object, and an initial sequence behaving as the starting point of the alignment. An example on how to set such parameters and create an MSA object is found in the next code snippet:
theseus::Penalties penalties(match, mismatch, gap_open, gap_extend);
theseus::Heuristics heuristics(use_lag_pruning, use_density_drop);
theseus::TheseusMSA aligner(penalties, heuristics, initial_sequence);
Once this is done, we can start adding sequences to our POA graph using the align functionality, that returns an Alignment object with CIGAR, path, and score information:
theseus::Alignment alignment_object = aligner.align(sequence);
Each time a new sequence is added to the POA graph, the graph is updated with the newly found variation (all the insertions, deletions and mismatches of the resulting alignment object).
Finally, we can output the result in four different formats: a graph in .gfa format, a multiple sequence alignment, a consensus sequence, and a graph in .dot format:
aligner.print_as_gfa(output_file); // Output the compacted POA graph in .gfa format
aligner.print_as_msa(output_file); // Output as a Multiple Sequence Alignment
sequence = aligner.get_consensus_sequence(); // Find the consensus sequence of the alignment
aligner.print_as_dot(output_file); // Output the compacted POA graph in .dot format
This example illustrates how to use Theseus as a general Sequence-to-Graph Aligner. First, include the alignment and general headers:
#include "theseus/alignment.h"
#include "theseus/graph.h"
#include "theseus/penalties.h"
#include "theseus/heuristics.h"
#include "theseus/theseus_aligner.h"
Then, create and configure an aligner object. This object is defined by three parameters: a set of penalties, a heuristics object, and a reference graph in Theseus' internal graph format. An example on how to set such parameters and create an aligner is found in the next code snippet:
theseus::Penalties penalties(match, mismatch, gap_open, gap_extend);
theseus::Heuristics heuristics(use_lag_pruning, use_density_drop);
theseus::TheseusAligner aligner(penalties, heuristics, std::move(graph));
Once this is done, we can start aligning sequences to the reference graph using the align functionality. Importantly, a call to the align function consists of four arguments: the sequence to be aligned, the starting vertex for the alignment, the starting offset in that starting vertex, and a boolean variable indicating the direction of alignment. The result of the alignment is an Alignment object with CIGAR, path and score information:
theseus::Alignment alignment_object = aligner.align(sequence, start_vertex, start_offset, align_reverse);
The Theseus' library, allows you to create your own reference graphs to perform sequence-to-graph alignment. A graph is composed of two key elements: nodes and edges. Nodes store genomics' data in the form of a sequence of characters, and edges represent connections between these existing nodes. If you want to create a graph, you first have to include the "theseus/graph.h" header file:
#include "theseus/graph.h"
Then, you have to assign it a name upon creation:
theseus::Graph graph;
You can add nodes using the "add_node(string)" functionality, that always returns a node identifier (NodeId) that will be necessary to create edges. For instance, if you want to create a node storing the sequence "ACTTAG", you should:
theseus::Graph::NodeId id_node1 = graph.add_node("ACTTAG");
Now, if you want to create an edge connecting two existing nodes, you have to provide the two node ids and use the "add_edge(id_node1, id_node2)" functionality:
graph.add_edge(id_node1, id_node2);
The Theseus library implements two minimal tools to use the Theseus algorithm on the MSA and Sequence-to-Graph alignment problems. It is important to note that these tools are not production ready.
This example illustrates how to use the theseus_msa tool. This tool computes the MSA of the set of sequences in an given input .fasta file. The executable is located in the path /build/tools/theseus_msa:
cd build/tools/
Select the scoring scheme, set the input and output files and execute the tool. Each execution of theseus_msa lets you select the following parameters:
Usage: theseus_msa [OPTIONS]
Options:\n"
-m, --match <int> The match penalty [default=0]
-x, --mismatch <int> The mismatch penalty [default=2]
-o, --gapo <int> The gap open penalty [default=3]
-e, --gape <int> The gap extension penalty [default=1]
-t, --output_type <int> The output format of the multiple alignment [default=0=MSA]
0: MSA: Standard Multiple Sequence Alignment format,
1: GFA: Output the resulting POA graph in GFA format,
2: Consensus: Output the consensus sequence,
3: Dot: Output in .dot format for visualization purposes.
Only tractable for small graphs
-f, --output <file> Output file [Required]
-s, --sequences <file> Dataset file [Required]"
Heuristics:\n"
-d --density_heuristic Activate the drop heuristic based on advancement density.
-l --lag_pruning Activate the pruning of diagonals lagging behind in the alignment.
An example of the execution of theseus_msa is shown in the following piece of code
./theseus_msa -m 0 -x 2 -o 3 -e 1 -t 0 -f output_file.out -s sequences.fasta
This example illustrates how to use the theseus_aligner tool. This tool aligns a set of sequences, given their starting vertices and offsets, to a reference graph. Two input files are required: 1) The reference graph in .gfa format, 2) The sequences to be aligned and the starting alignment positions in .fasta format.
[IMPORTANT] The .fasta file containing sequence and positional data has a special structure. As all .fasta files, the data associated to each sequence has two parts: 1) A line starting with ">" containing metadata, and 2) the sequence itself, that appears on the next lines.
- Constists of four elements: start_vertex name, start_offset, orientation of the alignment, and direction of alignment. The orientation of the alignment can either be "+" or "-" and it indicates Theseus whether to align from the forward or reverse strand of the given start node. The direction of alignment has to be 0 or 1 and it indicates in which direction (left to right or right to left) do you want to perform your alignment. [IMPORTANT], the start_offset value is respect the starting node in the given alignment orientation. That is, if alignment orientation is reverse ("-") and the offset is 0, alignment starts at the beggining of the reverse complemented segment.
- Contains the sequence itself.
For instance, a sequence ACGT starting at vertex v1, offset 3, in the forward orientation +, and aligned from left to right would be codified as:
> v1 3 + 0
ACGT
> ... (following lines)
The executable is located in the path /build/tools/theseus_aligner:
cd build/tools/
Select the scoring scheme, set the input and output files and execute the tool. Each execution of theseus_aligner lets you select the following parameters:
"Usage: theseus_aligner [OPTIONS]\n"
Options:\n"
Penalties:\n"
-m, --match <int> The match penalty [default=0]\n"
-x, --mismatch <int> The mismatch penalty [default=2]\n"
-o, --gapo <int> The gap open penalty [default=3]\n"
-e, --gape <int> The gap extension penalty [default=1]\n\n"
I/O:\n"
-g, --graph_file <file> Graph file in .gfa format [Required]\n"
-s, --sequences_file <file> Sequences and starting positons in .fasta format [Required]\n"
-f, --output_file <file> Output file [Required]\n\n"
Heuristics:\n"
-d --density_heuristic Activate the drop heuristic based on advancement density. \n"
-l --lag_pruning Activate the pruning of diagonals lagging behind int the alignment. \n";
An example of the execution of theseus_aligner is shown in the following piece of code
./theseus_aligner -m 0 -x 2 -o 3 -e 1 -g reference_graph.gfa -s sequences.fasta -f output.out
Theseus library implements some heuristic approaches that accelerate alignment at the expense of a limited loss in accuracy. In particular, Theseus implements 1) a pruning heuristic that discards diagonals that have fallen behind in the alignment, as long as the alignment has shown a significant advancement in the last scores, and 2) a drop heuristic that drops alignment when the advancement density (number of offsets advanced in the last scores) is very low.
You can activate these heuristics by initializing the Heuristics object with two separate boolean flags:
theseus::Heuristics heuristics(use_lag_pruning, use_density_drop);
Moreover, the three minimal tools provided in this alignment library allow you to activate these two heuristics from the command line, by adding the -d and -l flags:
./theseus_aligner -m 0 -x 2 -o 3 -e 1 -g reference_graph.gfa -s sequences.fasta -f output.out -d -l
The datasets used in our paper are available in Zenodo. Link to datasets
Feedback and bug reporting is highly appreciated. Please report any issue or suggestion on github or email to the main developer (albert.jimenez1@bsc.es). Don't hesitate to contact us if:
- You experience any bug or crash.
- You want to request a feature or have any suggestion.
- Your application using the library is running slower than it should or you expected.
- Need help integrating the library into your tool.
Theseus-lib is distributed under MIT licence.
Albert Jimenez Blanco (albert.jimenez1@bsc.es) is the main developer and the person you should address your complaints.
Lorién López-Villellas has had major contributions both in the technical implementation of Theseus and the final structure of the library.
