Skip to content

[Model] LongestCommonSubsequence #108

@zazabap

Description

@zazabap

Motivation

Foundational string problem underlying diff/version control, bioinformatics (DNA/protein alignment), and data compression. Poly-time for 2 strings via DP, but NP-hard for $k$ strings. Has known reductions to ILP and SAT. Garey & Johnson SR10.

Definition

Name: LongestCommonSubsequence
Reference: Garey & Johnson, 1979 (SR10); Maier, 1978

Given a finite alphabet $\Sigma$ and a set of $k$ strings $R = {s_1, \ldots, s_k}$ over $\Sigma$, find a longest string $w$ that is a subsequence of every $s_i \in R$.

A string $w$ is a subsequence of $s$ if $w$ can be obtained by deleting zero or more characters from $s$ without changing the order of the remaining characters.

Variables

  • Count: $m$ (length of the shortest string in $R$; upper bound on solution length)
  • Per-variable domain: $\Sigma$ (alphabet symbols)
  • Meaning: $w_j$ is the $j$-th character of the common subsequence $w$

Schema (data type)

Type name: LongestCommonSubsequence

Field Type Description
strings Vec<Vec<u8>> The input strings $R = {s_1, \ldots, s_k}$

Problem Size

Metric Expression Description
num_strings $k$ Number of input strings
total_length $\sum_i \lvert s_i \rvert$ Sum of all string lengths

Complexity

  • Two strings: $O(mn)$ via dynamic programming (Wagner & Fischer, 1974)
  • $k$ strings: NP-hard (Maier, 1978)
  • Approximation: Not approximable within $n^{1/4-\varepsilon}$ for any $\varepsilon &gt; 0$; approximable within $O(m / \log m)$ where $m = \min_i |s_i|$

How to solve

  • It can be solved by (existing) bruteforce.
  • It can be solved by reducing to integer programming, through a new LCS → ILP rule issue (to be filed).

Bruteforce: enumerate all $2^m$ subsequences of the shortest string $s_1$, check each against all other strings, return the longest common one.

Example Instance

$k = 3$ strings over alphabet $\Sigma = {A, B, C, D}$:

String Value
$s_1$ ABCDAB
$s_2$ BDCABA
$s_3$ BCADBA

Optimal solution: $w$ = BCAB, length 4.

Verification ($w$ is a subsequence of each $s_i$):

String Matched positions Trace
ABCDAB 1, 2, 4, 5 ABCDAB
BDCABA 0, 2, 3, 4 BDCABA
BCADBA 0, 1, 2, 4 BCADBA

Note: pairwise LCS lengths are 4, 4, and 5 respectively — the 3-string LCS is harder than any pairwise LCS. For $k = 2$ this problem is solvable in $O(mn)$ time, but the $k$-string generalization is NP-hard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions