Classic string-matching and string-analysis algorithms in pure Standard ML,
operating over string treated as a sequence of bytes — Knuth-Morris-Pratt,
the Z-function, Boyer-Moore, Rabin-Karp, Aho-Corasick multi-pattern search,
and Manacher's longest-palindrome algorithm.
No dependencies, no FFI, no threads, no clock, no randomness: the same inputs
always produce the same outputs under MLton and Poly/ML. All indices
are 0-based; overlapping occurrences are reported in full (e.g. "aa" in
"aaaa" gives [0,1,2]).
kmp/boyerMoore/rabinKarp— three different exact single-pattern matchers that always return the identical list of start positions.z— the Z-function (longest common prefix with each suffix).ahoCorasick— simultaneous search for many patterns via the Aho-Corasick automaton, returning(patternIndex, startPos)matches.manacher— a longest palindromic substring plus the per-center radii.
structure Stringalgo : sig
val kmp : string -> string -> int list (* all match start positions *)
val z : string -> int list (* z[i]=lcp(s, s[i..]); z[0]=|s| *)
val boyerMoore : string -> string -> int list (* same results as kmp *)
val rabinKarp : string -> string -> int list (* same results as kmp *)
(* (patternIndex, startPos), sorted by (startPos, patternIndex); empties ignored *)
val ahoCorasick : string list -> string -> (int * int) list
(* radii has length 2|text|+1; longest is the earliest longest palindrome *)
val manacher : string -> { longest : string, start : int, length : int,
radii : int list }
endThe empty pattern matches at every position 0..|text| inclusive, consistently
across kmp, boyerMoore and rabinKarp. boyerMoore uses the bad-character
and strong good-suffix heuristics; rabinKarp uses a polynomial rolling hash
with every hash hit verified by direct comparison, so the byte-level results
are identical on both compilers even though the internal hash width differs.
val [0,9,12] = Stringalgo.kmp "AABA" "AABAACAADAABAABA"
val [0,1,2] = Stringalgo.boyerMoore "aa" "aaaa" (* overlapping *)
val true = (Stringalgo.kmp "ana" "banana" = Stringalgo.rabinKarp "ana" "banana")
val [7,2,1,0,2,1,0] = Stringalgo.z "aaabaab"
(* she @1, he @2, hers @2 (indices into ["he","she","his","hers"]) *)
val [(1,1),(0,2),(3,2)] = Stringalgo.ahoCorasick ["he","she","his","hers"] "ushers"
val r = Stringalgo.manacher "babad"
val ("bab", 0, 3) = (#longest r, #start r, #length r)Running examples/demo.sml with make example prints:
Exact matching of "ana" in "bananaananana":
kmp = [1,3,6,8,10]
boyerMoore = [1,3,6,8,10]
rabinKarp = [1,3,6,8,10]
Z-function of "aaabaab":
z = [7,2,1,0,2,1,0]
Aho-Corasick ["he","she","his","hers"] over "ushers":
(patternIndex, startPos) = [(1,1),(0,2),(3,2)]
Manacher longest palindromic substring:
babad -> "bab" (start 0, length 3)
cbbd -> "bb" (start 1, length 2)
forgeeksskeegfor -> "geeksskeeg" (start 3, length 10)
Requires MLton and/or Poly/ML.
make test # build + run the suite under MLton
make test-poly # run the suite under Poly/ML
make all-tests # both
make example # build + run the demo
make cleansmlpkg add github.com/sjqtentacles/sml-stringalgo
smlpkg syncReference lib/github.com/sjqtentacles/sml-stringalgo/stringalgo.mlb from your
own .mlb (MLton / MLKit), or feed sources.mlb to tools/polybuild
(Poly/ML).
sml.pkg smlpkg manifest
Makefile MLton + Poly/ML targets
.github/workflows/ci.yml CI: MLton + Poly/ML
lib/github.com/sjqtentacles/sml-stringalgo/
stringalgo.sig STRINGALGO signature
stringalgo.sml KMP/Z/Boyer-Moore/Rabin-Karp/Aho-Corasick/Manacher
sources.mlb ordered source list
stringalgo.mlb public basis
examples/
demo.sml matching + Aho-Corasick + Manacher walkthrough
test/
harness.sml shared assertion harness
test.sml CLRS/GfG/CP reference vectors + cross-checks (112 checks)
entry.sml / main.sml
tools/polybuild Poly/ML build wrapper
112 deterministic checks. Reference vectors: "AABA" in "AABAACAADAABAABA"
→ [0,9,12], overlapping "aa" in "aaaa" → [0,1,2]; the Z-function of
"aaabaab" → [7,2,1,0,2,1,0]; Aho-Corasick of ["he","she","his","hers"]
over "ushers" finding she@1, he@2, hers@2; Manacher on "babad" → "bab",
"cbbd" → "bb", "forgeeksskeegfor" → "geeksskeeg". The three exact
matchers (kmp, boyerMoore, rabinKarp) are cross-checked against a naive
scanner and against each other on many inputs (including embedded NULs and
high bytes), Aho-Corasick is cross-checked per-pattern against the naive
scanner, and Manacher's reported length is checked against its radii. Run
make all-tests to verify identical output under both compilers.
MIT. See LICENSE.