Skip to content

lorrainea/rrBDA-index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

114 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rrBDA-index: Text Indexing for Long Patterns

Description

This repository maintains a time- and space-efficient construction algorithm of the rrBDA-index, a new variant of the BDA-index for long patterns introduced by Loukides, Pissis, and Sweering. This new construction relies on an algorithm for computing a randomized counterpart of reduced bd-anchors. It comes in two flavours: a semi-external memory implementation and an internal memory implementation.

Requirements

  • A GNU/Linux system
  • A modern C++17 ready compiler

How to use

INPUT: A file containing the text and a file containing a set of patterns seperated by a new line.

OUTPUT: A file containing the set of patterns and the starting position of their occurrences within the text.

Installation

For the construction that uses internal memory only and 32-bit integers, compile as follows:

cd rrBDA-index_int
./pre-install.sh
make -f Makefile.32-bit.gcc

For the construction that uses both external and internal memory and 32-bit integers, compile as follows:

cd rrBDA-index_ext
./pre-install.sh
make -f Makefile.32-bit.gcc

In the same directories, you can find makefiles for 64-bit integers.

Usage

./rrbda-index_int <text_file> <ell> <pattern_file> <block_size> <output_filename> <index_filename>
./rrbda-index_ext <text_file> <ell> <pattern_file> <block_size> <ram_use> <output_filename> <index_filename>

<text_file> - name of input text file.
<ell> - lower bound on the length of input patterns to consider. 
<pattern_file> - name of input file containing the patterns.
<block_size> - size of block to use for constructing the bd-anchors (bytes).
<ram_use> - RAM usage for external SA and LCP array construction (MiB).
<output_filename> - name of output file, where pattern occurrences will be output.
<index_filename> - name of the index file to be used (if it exists) otherwise to be created.

Examples

For the construction that uses internal memory only, run as follows:

 $ ./rrbda-index_int ./data/text 3 ./data/patterns 10 out index

For the construction that uses both external and internal memory, run as follows:

 $ ./rrbda-index_ext ./data/text 3 ./data/patterns 10 1024 out index

Note that the exact same index is constructed in the end.

Evaluation

We have conducted an extensive evaluation of different text indexes. We give a teaser below using the whole human genome (version GRCh38) and a set of 500K patterns (randomly selected from the text).

The plot below depicts the indexes size for growing , the lower bound on the supported pattern lengths.

size

The plot below depicts the average pattern matching time of the indexes for growing pattern length |P|.

query_time

Citation

Lorraine A. K. Ayad, Grigorios Loukides, and Solon P. Pissis. 2024. Text indexing for long patterns using locally consistent anchors. The VLDB Journal [doi].

License

GNU GPLv3 License; Copyright (C) 2024 Lorraine A. K. Ayad, Grigorios Loukides and Solon P. Pissis.

About

Text Indexing for Long Patterns

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors