-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Motivation
Foundational string problem underlying diff/version control, bioinformatics (DNA/protein alignment), and data compression. Poly-time for 2 strings via DP, but NP-hard for
Definition
Name: LongestCommonSubsequence
Reference: Garey & Johnson, 1979 (SR10); Maier, 1978
Given a finite alphabet
A string
Variables
-
Count:
$m$ (length of the shortest string in$R$ ; upper bound on solution length) -
Per-variable domain:
$\Sigma$ (alphabet symbols) -
Meaning:
$w_j$ is the$j$ -th character of the common subsequence$w$
Schema (data type)
Type name: LongestCommonSubsequence
| Field | Type | Description |
|---|---|---|
| strings | Vec<Vec<u8>> |
The input strings |
Problem Size
| Metric | Expression | Description |
|---|---|---|
| num_strings | Number of input strings | |
| total_length | Sum of all string lengths |
Complexity
-
Two strings:
$O(mn)$ via dynamic programming (Wagner & Fischer, 1974) -
$k$ strings: NP-hard (Maier, 1978) -
Approximation: Not approximable within
$n^{1/4-\varepsilon}$ for any$\varepsilon > 0$ ; approximable within$O(m / \log m)$ where$m = \min_i |s_i|$
How to solve
- It can be solved by (existing) bruteforce.
- It can be solved by reducing to integer programming, through a new LCS → ILP rule issue (to be filed).
Bruteforce: enumerate all
Example Instance
| String | Value |
|---|---|
ABCDAB |
|
BDCABA |
|
BCADBA |
Optimal solution: BCAB, length 4.
Verification (
| String | Matched positions | Trace |
|---|---|---|
ABCDAB |
1, 2, 4, 5 | ABCDAB |
BDCABA |
0, 2, 3, 4 | BDCABA |
BCADBA |
0, 1, 2, 4 | BCADBA |
Note: pairwise LCS lengths are 4, 4, and 5 respectively — the 3-string LCS is harder than any pairwise LCS. For