Skip to content

A Lua/LuaJIT string similarity library using edit distance and similarity algorithms.

License

Notifications You must be signed in to change notification settings

jkeresman01/strsim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

strsim

A string similarity library for Lua/LuaJIT

A string similarity library providing edit distance and similarity algorithms.

Lua Lua Lua

About

Note

Still very WIP

A string similarity library for Lua/LuaJIT providing edit distance and similarity algorithms.

Check this out:

Installation

Note

Library will be deployed to luarocks at some point

Currently you will have to clone the repository manaully

git clone https://github.com/yourusername/strsim.git

Tip

Or add as a dependency in your Lua path.

Usage

local strsim = require("init")

local dist = strsim.distance("kitten", "sitting")  -- 3
local sim = strsim.similarity("hello", "hallo")    -- 0.8

-- Use specific algorithm
local dist = strsim.distance("hello", "hallo", strsim.Algorithm.LEVENSHTEIN)

Note

If no algorithm is selected we will default to LEVENSHTEIN

Direct algorithm usage

local levenshtein = require("algorithms.levenshtein")

local dist = levenshtein.distance("kitten", "sitting")  -- 3
local sim = levenshtein.similarity("hello", "hallo")    -- 0.8

Algorithms

Note

Only LEVENSHTEIN distance is supported currently

LEVENSHTEIN - Levenshtein distance (insertions, deletions, substitutions)

Levenshtein distance is a string metric for measuring the difference between two sequences.

Check this out: https://en.wikipedia.org/wiki/Levenshtein_distance

Optimizations applied

  • Single-row DP (O(min(m,n)) space instead of O(m*n))
  • Early termination for equal strings
  • Stripping common prefix/suffix
  • Swapping strings to minimize memory usage
  • Avoiding table.insert, using direct indexing
  • Using byte access instead of sub() for single chars

Lua JIT specific optimizations applied:

  • Using local variables for hot paths (LuaJIT optimization)

DAMERAU_LEVENSHTEIN - Levenshtein + transpositions

Note

Planned

OSA - Optimal String Alignment

Note

Planned

JARO - Jaro similarity

Note

Planned

JARO_WINKLER - Jaro-Winkler similarity

Note

Planned

Hamming - Hamming distance (same-length strings only)

Note

Planned

LCS - Longest Common Subsequence

Note

Planned