Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.7.1] - 2026-06-02

### Added

- **Syllable-span probe (opt-in neural enhancement).** A frozen-encoder probe that improves detection recall on broken-compound, over-segmentation, and consonant-substitution errors. Three strategies share one small model (~430 MB) with a single inference per sentence:
- `ProbeBoostedCompoundStrategy` (priority 24) — combines neural span scoring with dictionary lookup to flag whitespace-broken compounds.
- `ProbeSegmenterRescueStrategy` (priority 26) — rescues typos the segmenter splits into adjacent dictionary-valid tokens by scoring the boundary and recovering a high-frequency merged-form candidate.
- `ProbeValidationStrategy` (priority 85) — a compact frozen-architecture replacement for the legacy token-classification corrector.
- **Probe configuration.** Opt-in via `use_probe_corrector`, `use_probe_compound`, and `use_probe_segmenter_rescue` config flags (or the `MSC_USE_PROBE_*` environment variables), with tunable confidence thresholds and dictionary-frequency floors. All default off.

### Changed

- The probe strategies are registered as suppression-immune so their detections survive the downstream filter cascade.

### Benchmark

- With the probe enabled (all flags), spelling composite improves `0.6436` → `0.6561` (**+0.0125**): +47 true positives, recall +3.2pp, top-1 accuracy +1.7pp, clean false-positive sentences 84 → 92. Default behavior (probe off) is unchanged.

### Compatibility

- No change to default behavior; existing deployments are unaffected unless the probe flags are enabled. Adds three public strategy classes in `myspellchecker.core.validation_strategies`. The probe model artifact is required at runtime only when the flags are enabled.

## [1.7.0] - 2026-04-30

### Added
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
* **Suffix-Aware Re-Segmentation**: DefaultSegmenter post-processes oversized tokens and colloquial-locative merges (e.g. `ရန်ကုန်မာ` → `[ရန်, ကုန်, မာ]`) for cleaner downstream validation.
* **Compound & Morpheme Handling**: DP-based compound resolution, ternary compound splits in morpheme correction, productive reduplication validation.
* **AI Semantic Checking (Optional)**: ONNX masked language model for context-aware validation.
* **Syllable-Span Probe (opt-in, v1.7.1)**: A frozen-encoder neural probe that improves recall on broken-compound, over-segmentation, and consonant-substitution errors. Three strategies share one small model; default-off, enabled via `use_probe_*` config flags or `MSC_USE_PROBE_*` environment variables.
* **Named Entity Recognition**: Heuristic and Transformer-based NER to reduce false positives on names and places.

### Dictionary Building Pipeline
Expand Down Expand Up @@ -64,7 +65,7 @@

Full documentation is available at **[docs.myspellchecker.com](https://docs.myspellchecker.com/)**.

> **What's new in v1.6.0?** See the **[Release Notes](https://docs.myspellchecker.com/reference/release-notes)** for new validation strategies (mined-confusable, pre-segmenter raw probe), the compound-split confusable boost, the skip-rule confidence gate, consonant-gated Tall-AA normalization, the flat-AA dictionary migration, and spelling-first benchmark labeling.
> **What's new in v1.7.1?** See the **[Release Notes](https://docs.myspellchecker.com/reference/release-notes)** for the opt-in **syllable-span probe** — a frozen-encoder neural enhancement (three default-off strategies sharing one small model) that improves recall on broken-compound, over-segmentation, and consonant-substitution errors (+0.0125 composite when enabled). Earlier v1.7.x work added mined-confusable detection, the cross-whitespace and compound-merge probes, the skip-rule confidence gate, and benchmark-hygiene reclassification.

### Getting Started
* **[Introduction](https://docs.myspellchecker.com/introduction)**: Overview of the library and its architecture.
Expand Down
79 changes: 79 additions & 0 deletions benchmarks/run_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -899,6 +899,85 @@ def run_benchmark(
existing_immune.add("ByT5SafetyNetStrategy")
config.validation.suppression_immune_strategies = frozenset(existing_immune)

# GECToR neural corrector overrides
gector_env = _os.environ.get("MSC_USE_GECTOR", "").strip().lower()
gector_path_env = _os.environ.get("MSC_GECTOR_MODEL_PATH", "").strip()
gector_min_conf_env = _os.environ.get("MSC_GECTOR_MIN_CONFIDENCE", "").strip()
gector_conf_env = _os.environ.get("MSC_GECTOR_CONFIDENCE", "").strip()
gector_max_existing_env = _os.environ.get("MSC_GECTOR_MAX_EXISTING_ERRORS", "").strip()
if gector_env in ("1", "true", "yes", "on"):
config.validation.use_gector = True
print(" use_gector: True")
if gector_path_env:
config.validation.gector_model_path = gector_path_env
print(f" gector_model_path: {gector_path_env}")
if gector_min_conf_env:
config.validation.gector_min_confidence = float(gector_min_conf_env)
print(f" gector_min_confidence: {gector_min_conf_env}")
if gector_conf_env:
config.validation.gector_confidence = float(gector_conf_env)
print(f" gector_confidence: {gector_conf_env}")
if gector_max_existing_env:
config.validation.gector_max_existing_errors = int(gector_max_existing_env)
print(f" gector_max_existing_errors: {gector_max_existing_env}")
if config.validation.use_gector:
existing_immune = set(config.validation.suppression_immune_strategies or ())
existing_immune.add("GECToRValidationStrategy")
config.validation.suppression_immune_strategies = frozenset(existing_immune)

# Probe-based syllable-span detection overrides (v1.7.x neural enhancement).
# Three strategies share one probe model; toggle each independently.
probe_corr_env = _os.environ.get("MSC_USE_PROBE_CORRECTOR", "").strip().lower()
probe_comp_env = _os.environ.get("MSC_USE_PROBE_COMPOUND", "").strip().lower()
probe_rescue_env = _os.environ.get("MSC_USE_PROBE_RESCUE", "").strip().lower()
probe_path_env = _os.environ.get("MSC_PROBE_MODEL_PATH", "").strip()
probe_corr_thr_env = _os.environ.get("MSC_PROBE_CORRECTOR_THRESHOLD", "").strip()
probe_comp_thr_env = _os.environ.get("MSC_PROBE_COMPOUND_THRESHOLD", "").strip()
probe_comp_freq_env = _os.environ.get("MSC_PROBE_COMPOUND_MIN_FREQ", "").strip()
probe_rescue_thr_env = _os.environ.get("MSC_PROBE_RESCUE_THRESHOLD", "").strip()
probe_rescue_freq_env = _os.environ.get("MSC_PROBE_RESCUE_MIN_FREQ", "").strip()
probe_max_existing_env = _os.environ.get("MSC_PROBE_MAX_EXISTING_ERRORS", "").strip()
if probe_corr_env in ("1", "true", "yes", "on"):
config.validation.use_probe_corrector = True
print(" use_probe_corrector: True")
if probe_comp_env in ("1", "true", "yes", "on"):
config.validation.use_probe_compound = True
print(" use_probe_compound: True")
if probe_rescue_env in ("1", "true", "yes", "on"):
config.validation.use_probe_segmenter_rescue = True
print(" use_probe_segmenter_rescue: True")
if probe_path_env:
config.validation.probe_model_path = probe_path_env
print(f" probe_model_path: {probe_path_env}")
if probe_corr_thr_env:
config.validation.probe_corrector_threshold = float(probe_corr_thr_env)
print(f" probe_corrector_threshold: {probe_corr_thr_env}")
if probe_comp_thr_env:
config.validation.probe_compound_threshold = float(probe_comp_thr_env)
print(f" probe_compound_threshold: {probe_comp_thr_env}")
if probe_comp_freq_env:
config.validation.probe_compound_min_freq = int(probe_comp_freq_env)
print(f" probe_compound_min_freq: {probe_comp_freq_env}")
if probe_rescue_thr_env:
config.validation.probe_rescue_threshold = float(probe_rescue_thr_env)
print(f" probe_rescue_threshold: {probe_rescue_thr_env}")
if probe_rescue_freq_env:
config.validation.probe_rescue_min_freq = int(probe_rescue_freq_env)
print(f" probe_rescue_min_freq: {probe_rescue_freq_env}")
if probe_max_existing_env:
config.validation.probe_max_existing_errors = int(probe_max_existing_env)
print(f" probe_max_existing_errors: {probe_max_existing_env}")
# Auto-register ProbeValidationStrategy as suppression-immune (it uses the
# GECToRValidationStrategy class label so the downstream filters treat it
# the same as the legacy GECToR strategy for fusion / meta-classifier
# bypass purposes). ProbeBoostedCompoundStrategy is NOT added to this set
# — its dictionary-gated suggestions need to flow through the normal
# suppression cascade (verified empirically: adding it costs +25 FP).
if config.validation.use_probe_corrector:
existing_immune = set(config.validation.suppression_immune_strategies or ())
existing_immune.add("GECToRValidationStrategy")
config.validation.suppression_immune_strategies = frozenset(existing_immune)

# Initialize checker with specified database
provider = SQLiteProvider(database_path=str(db_path))
checker = SpellChecker(config=config, provider=provider)
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "myspellchecker"
version = "1.7.0"
version = "1.7.1"
description = "Myanmar (Burmese) text intelligence library — spell checking, grammar validation, dictionary building, and AI model training"
readme = "README.md"
# NOTE: Codebase uses Python 3.10+ syntax (PEP 604 unions like `str | None`).
Expand Down
16 changes: 16 additions & 0 deletions src/myspellchecker/algorithms/probe/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""Probe-based syllable-span detection for Myanmar spell checking.

Provides a frozen-encoder + thin-Linear-head detector that achieves +0.0067
composite when paired with rule-based correction strategies via the
ProbeBoostedCompoundStrategy and ProbeValidationStrategy.

See [[Probe Hybrid Ships at +0.0067 2026-05-03]] for design and benchmark
results.
"""

from myspellchecker.algorithms.probe.syllable_span_probe import (
FrozenSyllableSpanProbe,
ProbeInferenceEngine,
)

__all__ = ["FrozenSyllableSpanProbe", "ProbeInferenceEngine"]
198 changes: 198 additions & 0 deletions src/myspellchecker/algorithms/probe/syllable_span_probe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
"""Frozen-encoder + thin-Linear-head syllable span probe.

Production module for the v1.7.x neural enhancement. Wraps a frozen BERT-class
encoder with a single Linear layer that emits per-syllable binary span scores.
Trained for ~5 minutes on 50K examples, head-only (no encoder fine-tuning).

Run-time inference helpers project per-syllable scores onto words via:
- direct char-span overlap, OR
- whitespace-adjacency (high-prob whitespace syllable attaches to the
preceding Myanmar word — the broken_compound signal).

Validated artifact: ``models/probe-syllable-span-v1/`` (head.pt + config.json).
See ``30_Audits/Probe Hybrid Ships at +0.0067 2026-05-03.md``.
"""

from __future__ import annotations

import json
from dataclasses import dataclass
from pathlib import Path
from typing import TYPE_CHECKING

import numpy as np

from myspellchecker.utils.logging_utils import get_logger

if TYPE_CHECKING: # pragma: no cover - type-only imports
pass

logger = get_logger(__name__)


def _detect_device() -> str:
"""Return the best available torch device string."""
import torch

if torch.cuda.is_available():
return "cuda"
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
return "mps"
return "cpu"


class FrozenSyllableSpanProbe:
"""Frozen BERT encoder + thin per-syllable binary head."""

def __init__(self, encoder_path: str | Path):
import torch
import torch.nn as nn
from transformers import AutoModel

self.encoder = AutoModel.from_pretrained(str(encoder_path))
for param in self.encoder.parameters():
param.requires_grad = False
self.head = nn.Linear(self.encoder.config.hidden_size, 1)
self._torch = torch
self._nn = nn

def to(self, device: str) -> "FrozenSyllableSpanProbe":
self.encoder.to(device)
self.head.to(device)
return self

def eval(self) -> "FrozenSyllableSpanProbe":
self.encoder.eval()
self.head.eval()
return self

@property
def hidden_size(self) -> int:
return int(self.encoder.config.hidden_size)

def predict_logits(self, input_ids, attention_mask, syl_to_subword_mask):
"""Return per-syllable logits (B, S).

Args:
input_ids: (B, T) tensor of subword ids.
attention_mask: (B, T) tensor of 0/1.
syl_to_subword_mask: (B, S, T) float tensor; per syllable, mask
of which subwords belong to it (overlap-based).
"""
torch = self._torch
with torch.no_grad():
out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
hidden = out.last_hidden_state # (B, T, H)
mask = syl_to_subword_mask.float()
denom = mask.sum(dim=-1, keepdim=True).clamp(min=1)
syl_hidden = torch.einsum("bst,bth->bsh", mask, hidden) / denom
return self.head(syl_hidden).squeeze(-1)


@dataclass
class _SyllableSpan:
"""Helper: a syllable's text and char span in the input."""

text: str
start: int
end: int


class ProbeInferenceEngine:
"""High-level inference helper used by validation strategies.

Loads probe + tokenizer + syllable segmenter once and exposes a single
``score_sentence(text)`` that returns per-syllable probabilities and the
underlying syllable spans.
"""

def __init__(
self,
model_path: str | Path,
device: str | None = None,
max_length: int = 256,
):
import torch
from transformers import AutoTokenizer

from myspellchecker.segmenters.regex import RegexSegmenter

model_path = Path(model_path)
if not model_path.exists():
raise FileNotFoundError(
f"Probe model directory not found: {model_path}. Expected head.pt + config.json."
)

config_path = model_path / "config.json"
if not config_path.exists():
raise FileNotFoundError(f"Missing probe config.json at {config_path}")
cfg = json.loads(config_path.read_text())
encoder_path = cfg["encoder"]

head_path = model_path / "head.pt"
if not head_path.exists():
raise FileNotFoundError(f"Missing probe head.pt at {head_path}")

self._torch = torch
self.tokenizer = AutoTokenizer.from_pretrained(str(encoder_path))
self.model = FrozenSyllableSpanProbe(encoder_path)
self.model.head.load_state_dict(torch.load(str(head_path), map_location=device or "cpu"))
self.device = device or _detect_device()
self.model.to(self.device)
self.model.eval()
self.max_length = max_length
self.segmenter = RegexSegmenter()
logger.info(
"ProbeInferenceEngine loaded: encoder=%s head=%s device=%s",
encoder_path,
head_path,
self.device,
)

def score_sentence(self, text: str) -> tuple[list[float], list[_SyllableSpan]]:
"""Return (per-syllable probability list, syllable span list)."""
if not text:
return [], []

syllables = self.segmenter.segment_syllables(text)
if not syllables:
return [], []

cursor = 0
spans: list[_SyllableSpan] = []
for s in syllables:
idx = text.find(s, cursor)
if idx == -1:
idx = cursor
spans.append(_SyllableSpan(text=s, start=idx, end=idx + len(s)))
cursor = idx + len(s)

enc = self.tokenizer(
text,
return_offsets_mapping=True,
truncation=True,
max_length=self.max_length,
return_tensors=None,
padding=False,
)
T = len(enc["input_ids"])
S = len(spans)
mask = np.zeros((S, T), dtype=np.float32)
for t, (cs, ce) in enumerate(enc["offset_mapping"]):
if cs == ce:
continue
for s_idx, span in enumerate(spans):
if cs < span.end and ce > span.start:
mask[s_idx, t] = 1.0
valid = mask.sum(axis=1) > 0
if not valid.any():
return [0.0] * S, spans

torch = self._torch
input_ids_t = torch.tensor(enc["input_ids"]).unsqueeze(0).to(self.device)
am_t = torch.tensor(enc["attention_mask"]).unsqueeze(0).to(self.device)
mask_t = torch.from_numpy(mask).unsqueeze(0).to(self.device)
logits = self.model.predict_logits(input_ids_t, am_t, mask_t)
probs = torch.sigmoid(logits[0]).cpu().numpy()
probs[~valid] = 0.0
return probs.tolist(), spans
Loading
Loading