Skip to content

sjqtentacles/sml-stringalgo

Repository files navigation

sml-stringalgo

CI

Classic string-matching and string-analysis algorithms in pure Standard ML, operating over string treated as a sequence of bytes — Knuth-Morris-Pratt, the Z-function, Boyer-Moore, Rabin-Karp, Aho-Corasick multi-pattern search, and Manacher's longest-palindrome algorithm.

No dependencies, no FFI, no threads, no clock, no randomness: the same inputs always produce the same outputs under MLton and Poly/ML. All indices are 0-based; overlapping occurrences are reported in full (e.g. "aa" in "aaaa" gives [0,1,2]).

  • kmp / boyerMoore / rabinKarp — three different exact single-pattern matchers that always return the identical list of start positions.
  • z — the Z-function (longest common prefix with each suffix).
  • ahoCorasick — simultaneous search for many patterns via the Aho-Corasick automaton, returning (patternIndex, startPos) matches.
  • manacher — a longest palindromic substring plus the per-center radii.

API

structure Stringalgo : sig
  val kmp        : string -> string -> int list   (* all match start positions *)
  val z          : string -> int list             (* z[i]=lcp(s, s[i..]); z[0]=|s| *)
  val boyerMoore : string -> string -> int list   (* same results as kmp *)
  val rabinKarp  : string -> string -> int list   (* same results as kmp *)
  (* (patternIndex, startPos), sorted by (startPos, patternIndex); empties ignored *)
  val ahoCorasick : string list -> string -> (int * int) list
  (* radii has length 2|text|+1; longest is the earliest longest palindrome *)
  val manacher : string -> { longest : string, start : int, length : int,
                             radii : int list }
end

The empty pattern matches at every position 0..|text| inclusive, consistently across kmp, boyerMoore and rabinKarp. boyerMoore uses the bad-character and strong good-suffix heuristics; rabinKarp uses a polynomial rolling hash with every hash hit verified by direct comparison, so the byte-level results are identical on both compilers even though the internal hash width differs.

Example

val [0,9,12] = Stringalgo.kmp "AABA" "AABAACAADAABAABA"
val [0,1,2]  = Stringalgo.boyerMoore "aa" "aaaa"      (* overlapping *)
val true     = (Stringalgo.kmp "ana" "banana" = Stringalgo.rabinKarp "ana" "banana")
val [7,2,1,0,2,1,0] = Stringalgo.z "aaabaab"

(* she @1, he @2, hers @2 (indices into ["he","she","his","hers"]) *)
val [(1,1),(0,2),(3,2)] = Stringalgo.ahoCorasick ["he","she","his","hers"] "ushers"

val r = Stringalgo.manacher "babad"
val ("bab", 0, 3) = (#longest r, #start r, #length r)

Running examples/demo.sml with make example prints:

Exact matching of "ana" in "bananaananana":
  kmp        = [1,3,6,8,10]
  boyerMoore = [1,3,6,8,10]
  rabinKarp  = [1,3,6,8,10]

Z-function of "aaabaab":
  z = [7,2,1,0,2,1,0]

Aho-Corasick ["he","she","his","hers"] over "ushers":
  (patternIndex, startPos) = [(1,1),(0,2),(3,2)]

Manacher longest palindromic substring:
  babad -> "bab" (start 0, length 3)
  cbbd -> "bb" (start 1, length 2)
  forgeeksskeegfor -> "geeksskeeg" (start 3, length 10)

Build & test

Requires MLton and/or Poly/ML.

make test        # build + run the suite under MLton
make test-poly   # run the suite under Poly/ML
make all-tests   # both
make example     # build + run the demo
make clean

Installing with smlpkg

smlpkg add github.com/sjqtentacles/sml-stringalgo
smlpkg sync

Reference lib/github.com/sjqtentacles/sml-stringalgo/stringalgo.mlb from your own .mlb (MLton / MLKit), or feed sources.mlb to tools/polybuild (Poly/ML).

Layout

sml.pkg                                       smlpkg manifest
Makefile                                      MLton + Poly/ML targets
.github/workflows/ci.yml                      CI: MLton + Poly/ML
lib/github.com/sjqtentacles/sml-stringalgo/
  stringalgo.sig   STRINGALGO signature
  stringalgo.sml   KMP/Z/Boyer-Moore/Rabin-Karp/Aho-Corasick/Manacher
  sources.mlb      ordered source list
  stringalgo.mlb   public basis
examples/
  demo.sml         matching + Aho-Corasick + Manacher walkthrough
test/
  harness.sml      shared assertion harness
  test.sml         CLRS/GfG/CP reference vectors + cross-checks (112 checks)
  entry.sml / main.sml
tools/polybuild    Poly/ML build wrapper

Tests

112 deterministic checks. Reference vectors: "AABA" in "AABAACAADAABAABA"[0,9,12], overlapping "aa" in "aaaa"[0,1,2]; the Z-function of "aaabaab"[7,2,1,0,2,1,0]; Aho-Corasick of ["he","she","his","hers"] over "ushers" finding she@1, he@2, hers@2; Manacher on "babad""bab", "cbbd""bb", "forgeeksskeegfor""geeksskeeg". The three exact matchers (kmp, boyerMoore, rabinKarp) are cross-checked against a naive scanner and against each other on many inputs (including embedded NULs and high bytes), Aho-Corasick is cross-checked per-pattern against the naive scanner, and Manacher's reported length is checked against its radii. Run make all-tests to verify identical output under both compilers.

License

MIT. See LICENSE.

About

Classic string matching & analysis in pure Standard ML: KMP, Z-function, Boyer-Moore, Rabin-Karp, Aho-Corasick, Manacher (MLton + Poly/ML)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors