thettwe · thettwe · Jun 2, 2026 · Apr 30, 2026 · Apr 30, 2026 · Jun 2, 2026
@@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.7.1] - 2026-06-02
+
+### Added
+
+- **Syllable-span probe (opt-in neural enhancement).** A frozen-encoder probe that improves detection recall on broken-compound, over-segmentation, and consonant-substitution errors. Three strategies share one small model (~430 MB) with a single inference per sentence:
+  - `ProbeBoostedCompoundStrategy` (priority 24) — combines neural span scoring with dictionary lookup to flag whitespace-broken compounds.
+  - `ProbeSegmenterRescueStrategy` (priority 26) — rescues typos the segmenter splits into adjacent dictionary-valid tokens by scoring the boundary and recovering a high-frequency merged-form candidate.
+  - `ProbeValidationStrategy` (priority 85) — a compact frozen-architecture replacement for the legacy token-classification corrector.
+- **Probe configuration.** Opt-in via `use_probe_corrector`, `use_probe_compound`, and `use_probe_segmenter_rescue` config flags (or the `MSC_USE_PROBE_*` environment variables), with tunable confidence thresholds and dictionary-frequency floors. All default off.
+
+### Changed
+
+- The probe strategies are registered as suppression-immune so their detections survive the downstream filter cascade.
+
+### Benchmark
+
+- With the probe enabled (all flags), spelling composite improves `0.6436` → `0.6561` (**+0.0125**): +47 true positives, recall +3.2pp, top-1 accuracy +1.7pp, clean false-positive sentences 84 → 92. Default behavior (probe off) is unchanged.
+
+### Compatibility
+
+- No change to default behavior; existing deployments are unaffected unless the probe flags are enabled. Adds three public strategy classes in `myspellchecker.core.validation_strategies`. The probe model artifact is required at runtime only when the flags are enabled.
+
 ## [1.7.0] - 2026-04-30
 
 ### Added

@@ -33,6 +33,7 @@
 *   **Suffix-Aware Re-Segmentation**: DefaultSegmenter post-processes oversized tokens and colloquial-locative merges (e.g. `ရန်ကုန်မာ` → `[ရန်, ကုန်, မာ]`) for cleaner downstream validation.
 *   **Compound & Morpheme Handling**: DP-based compound resolution, ternary compound splits in morpheme correction, productive reduplication validation.
 *   **AI Semantic Checking (Optional)**: ONNX masked language model for context-aware validation.
+*   **Syllable-Span Probe (opt-in, v1.7.1)**: A frozen-encoder neural probe that improves recall on broken-compound, over-segmentation, and consonant-substitution errors. Three strategies share one small model; default-off, enabled via `use_probe_*` config flags or `MSC_USE_PROBE_*` environment variables.
 *   **Named Entity Recognition**: Heuristic and Transformer-based NER to reduce false positives on names and places.
 
 ### Dictionary Building Pipeline
@@ -64,7 +65,7 @@
 
 Full documentation is available at **[docs.myspellchecker.com](https://docs.myspellchecker.com/)**.
 
-> **What's new in v1.6.0?** See the **[Release Notes](https://docs.myspellchecker.com/reference/release-notes)** for new validation strategies (mined-confusable, pre-segmenter raw probe), the compound-split confusable boost, the skip-rule confidence gate, consonant-gated Tall-AA normalization, the flat-AA dictionary migration, and spelling-first benchmark labeling.
+> **What's new in v1.7.1?** See the **[Release Notes](https://docs.myspellchecker.com/reference/release-notes)** for the opt-in **syllable-span probe** — a frozen-encoder neural enhancement (three default-off strategies sharing one small model) that improves recall on broken-compound, over-segmentation, and consonant-substitution errors (+0.0125 composite when enabled). Earlier v1.7.x work added mined-confusable detection, the cross-whitespace and compound-merge probes, the skip-rule confidence gate, and benchmark-hygiene reclassification.
 
 ### Getting Started
 *   **[Introduction](https://docs.myspellchecker.com/introduction)**: Overview of the library and its architecture.

@@ -899,6 +899,85 @@ def run_benchmark(
         existing_immune.add("ByT5SafetyNetStrategy")
         config.validation.suppression_immune_strategies = frozenset(existing_immune)
 
+    # GECToR neural corrector overrides
+    gector_env = _os.environ.get("MSC_USE_GECTOR", "").strip().lower()
+    gector_path_env = _os.environ.get("MSC_GECTOR_MODEL_PATH", "").strip()
+    gector_min_conf_env = _os.environ.get("MSC_GECTOR_MIN_CONFIDENCE", "").strip()
+    gector_conf_env = _os.environ.get("MSC_GECTOR_CONFIDENCE", "").strip()
+    gector_max_existing_env = _os.environ.get("MSC_GECTOR_MAX_EXISTING_ERRORS", "").strip()
+    if gector_env in ("1", "true", "yes", "on"):
+        config.validation.use_gector = True
+        print("  use_gector: True")
+    if gector_path_env:
+        config.validation.gector_model_path = gector_path_env
+        print(f"  gector_model_path: {gector_path_env}")
+    if gector_min_conf_env:
+        config.validation.gector_min_confidence = float(gector_min_conf_env)
+        print(f"  gector_min_confidence: {gector_min_conf_env}")
+    if gector_conf_env:
+        config.validation.gector_confidence = float(gector_conf_env)
+        print(f"  gector_confidence: {gector_conf_env}")
+    if gector_max_existing_env:
+        config.validation.gector_max_existing_errors = int(gector_max_existing_env)
+        print(f"  gector_max_existing_errors: {gector_max_existing_env}")
+    if config.validation.use_gector:
+        existing_immune = set(config.validation.suppression_immune_strategies or ())
+        existing_immune.add("GECToRValidationStrategy")
+        config.validation.suppression_immune_strategies = frozenset(existing_immune)
+
+    # Probe-based syllable-span detection overrides (v1.7.x neural enhancement).
+    # Three strategies share one probe model; toggle each independently.
+    probe_corr_env = _os.environ.get("MSC_USE_PROBE_CORRECTOR", "").strip().lower()
+    probe_comp_env = _os.environ.get("MSC_USE_PROBE_COMPOUND", "").strip().lower()
+    probe_rescue_env = _os.environ.get("MSC_USE_PROBE_RESCUE", "").strip().lower()
+    probe_path_env = _os.environ.get("MSC_PROBE_MODEL_PATH", "").strip()
+    probe_corr_thr_env = _os.environ.get("MSC_PROBE_CORRECTOR_THRESHOLD", "").strip()
+    probe_comp_thr_env = _os.environ.get("MSC_PROBE_COMPOUND_THRESHOLD", "").strip()
+    probe_comp_freq_env = _os.environ.get("MSC_PROBE_COMPOUND_MIN_FREQ", "").strip()
+    probe_rescue_thr_env = _os.environ.get("MSC_PROBE_RESCUE_THRESHOLD", "").strip()
+    probe_rescue_freq_env = _os.environ.get("MSC_PROBE_RESCUE_MIN_FREQ", "").strip()
+    probe_max_existing_env = _os.environ.get("MSC_PROBE_MAX_EXISTING_ERRORS", "").strip()
+    if probe_corr_env in ("1", "true", "yes", "on"):
+        config.validation.use_probe_corrector = True
+        print("  use_probe_corrector: True")
+    if probe_comp_env in ("1", "true", "yes", "on"):
+        config.validation.use_probe_compound = True
+        print("  use_probe_compound: True")
+    if probe_rescue_env in ("1", "true", "yes", "on"):
+        config.validation.use_probe_segmenter_rescue = True
+        print("  use_probe_segmenter_rescue: True")
+    if probe_path_env:
+        config.validation.probe_model_path = probe_path_env
+        print(f"  probe_model_path: {probe_path_env}")
+    if probe_corr_thr_env:
+        config.validation.probe_corrector_threshold = float(probe_corr_thr_env)
+        print(f"  probe_corrector_threshold: {probe_corr_thr_env}")
+    if probe_comp_thr_env:
+        config.validation.probe_compound_threshold = float(probe_comp_thr_env)
+        print(f"  probe_compound_threshold: {probe_comp_thr_env}")
+    if probe_comp_freq_env:
+        config.validation.probe_compound_min_freq = int(probe_comp_freq_env)
+        print(f"  probe_compound_min_freq: {probe_comp_freq_env}")
+    if probe_rescue_thr_env:
+        config.validation.probe_rescue_threshold = float(probe_rescue_thr_env)
+        print(f"  probe_rescue_threshold: {probe_rescue_thr_env}")
+    if probe_rescue_freq_env:
+        config.validation.probe_rescue_min_freq = int(probe_rescue_freq_env)
+        print(f"  probe_rescue_min_freq: {probe_rescue_freq_env}")
+    if probe_max_existing_env:
+        config.validation.probe_max_existing_errors = int(probe_max_existing_env)
+        print(f"  probe_max_existing_errors: {probe_max_existing_env}")
+    # Auto-register ProbeValidationStrategy as suppression-immune (it uses the
+    # GECToRValidationStrategy class label so the downstream filters treat it
+    # the same as the legacy GECToR strategy for fusion / meta-classifier
+    # bypass purposes). ProbeBoostedCompoundStrategy is NOT added to this set
+    # — its dictionary-gated suggestions need to flow through the normal
+    # suppression cascade (verified empirically: adding it costs +25 FP).
+    if config.validation.use_probe_corrector:
+        existing_immune = set(config.validation.suppression_immune_strategies or ())
+        existing_immune.add("GECToRValidationStrategy")
+        config.validation.suppression_immune_strategies = frozenset(existing_immune)
+
     # Initialize checker with specified database
     provider = SQLiteProvider(database_path=str(db_path))
     checker = SpellChecker(config=config, provider=provider)

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "myspellchecker"
-version = "1.7.0"
+version = "1.7.1"
 description = "Myanmar (Burmese) text intelligence library — spell checking, grammar validation, dictionary building, and AI model training"
 readme = "README.md"
 # NOTE: Codebase uses Python 3.10+ syntax (PEP 604 unions like `str | None`).

@@ -0,0 +1,16 @@
+"""Probe-based syllable-span detection for Myanmar spell checking.
+
+Provides a frozen-encoder + thin-Linear-head detector that achieves +0.0067
+composite when paired with rule-based correction strategies via the
+ProbeBoostedCompoundStrategy and ProbeValidationStrategy.
+
+See [[Probe Hybrid Ships at +0.0067 2026-05-03]] for design and benchmark
+results.
+"""
+
+from myspellchecker.algorithms.probe.syllable_span_probe import (
+    FrozenSyllableSpanProbe,
+    ProbeInferenceEngine,
+)
+
+__all__ = ["FrozenSyllableSpanProbe", "ProbeInferenceEngine"]
@@ -0,0 +1,198 @@
+"""Frozen-encoder + thin-Linear-head syllable span probe.
+
+Production module for the v1.7.x neural enhancement. Wraps a frozen BERT-class
+encoder with a single Linear layer that emits per-syllable binary span scores.
+Trained for ~5 minutes on 50K examples, head-only (no encoder fine-tuning).
+
+Run-time inference helpers project per-syllable scores onto words via:
+  - direct char-span overlap, OR
+  - whitespace-adjacency (high-prob whitespace syllable attaches to the
+    preceding Myanmar word — the broken_compound signal).
+
+Validated artifact: ``models/probe-syllable-span-v1/`` (head.pt + config.json).
+See ``30_Audits/Probe Hybrid Ships at +0.0067 2026-05-03.md``.
+"""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from pathlib import Path
+from typing import TYPE_CHECKING
+
+import numpy as np
+
+from myspellchecker.utils.logging_utils import get_logger
+
+if TYPE_CHECKING:  # pragma: no cover - type-only imports
+    pass
+
+logger = get_logger(__name__)
+
+
+def _detect_device() -> str:
+    """Return the best available torch device string."""
+    import torch
+
+    if torch.cuda.is_available():
+        return "cuda"
+    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+
+
+class FrozenSyllableSpanProbe:
+    """Frozen BERT encoder + thin per-syllable binary head."""
+
+    def __init__(self, encoder_path: str | Path):
+        import torch
+        import torch.nn as nn
+        from transformers import AutoModel
+
+        self.encoder = AutoModel.from_pretrained(str(encoder_path))
+        for param in self.encoder.parameters():
+            param.requires_grad = False
+        self.head = nn.Linear(self.encoder.config.hidden_size, 1)
+        self._torch = torch
+        self._nn = nn
+
+    def to(self, device: str) -> "FrozenSyllableSpanProbe":
+        self.encoder.to(device)
+        self.head.to(device)
+        return self
+
+    def eval(self) -> "FrozenSyllableSpanProbe":
+        self.encoder.eval()
+        self.head.eval()
+        return self
+
+    @property
+    def hidden_size(self) -> int:
+        return int(self.encoder.config.hidden_size)
+
+    def predict_logits(self, input_ids, attention_mask, syl_to_subword_mask):
+        """Return per-syllable logits (B, S).
+
+        Args:
+            input_ids: (B, T) tensor of subword ids.
+            attention_mask: (B, T) tensor of 0/1.
+            syl_to_subword_mask: (B, S, T) float tensor; per syllable, mask
+                of which subwords belong to it (overlap-based).
+        """
+        torch = self._torch
+        with torch.no_grad():
+            out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+            hidden = out.last_hidden_state  # (B, T, H)
+            mask = syl_to_subword_mask.float()
+            denom = mask.sum(dim=-1, keepdim=True).clamp(min=1)
+            syl_hidden = torch.einsum("bst,bth->bsh", mask, hidden) / denom
+            return self.head(syl_hidden).squeeze(-1)
+
+
+@dataclass
+class _SyllableSpan:
+    """Helper: a syllable's text and char span in the input."""
+
+    text: str
+    start: int
+    end: int
+
+
+class ProbeInferenceEngine:
+    """High-level inference helper used by validation strategies.
+
+    Loads probe + tokenizer + syllable segmenter once and exposes a single
+    ``score_sentence(text)`` that returns per-syllable probabilities and the
+    underlying syllable spans.
+    """
+
+    def __init__(
+        self,
+        model_path: str | Path,
+        device: str | None = None,
+        max_length: int = 256,
+    ):
+        import torch
+        from transformers import AutoTokenizer
+
+        from myspellchecker.segmenters.regex import RegexSegmenter
+
+        model_path = Path(model_path)
+        if not model_path.exists():
+            raise FileNotFoundError(
+                f"Probe model directory not found: {model_path}. Expected head.pt + config.json."
+            )
+
+        config_path = model_path / "config.json"
+        if not config_path.exists():
+            raise FileNotFoundError(f"Missing probe config.json at {config_path}")
+        cfg = json.loads(config_path.read_text())
+        encoder_path = cfg["encoder"]
+
+        head_path = model_path / "head.pt"
+        if not head_path.exists():
+            raise FileNotFoundError(f"Missing probe head.pt at {head_path}")
+
+        self._torch = torch
+        self.tokenizer = AutoTokenizer.from_pretrained(str(encoder_path))
+        self.model = FrozenSyllableSpanProbe(encoder_path)
+        self.model.head.load_state_dict(torch.load(str(head_path), map_location=device or "cpu"))
+        self.device = device or _detect_device()
+        self.model.to(self.device)
+        self.model.eval()
+        self.max_length = max_length
+        self.segmenter = RegexSegmenter()
+        logger.info(
+            "ProbeInferenceEngine loaded: encoder=%s head=%s device=%s",
+            encoder_path,
+            head_path,
+            self.device,
+        )
+
+    def score_sentence(self, text: str) -> tuple[list[float], list[_SyllableSpan]]:
+        """Return (per-syllable probability list, syllable span list)."""
+        if not text:
+            return [], []
+
+        syllables = self.segmenter.segment_syllables(text)
+        if not syllables:
+            return [], []
+
+        cursor = 0
+        spans: list[_SyllableSpan] = []
+        for s in syllables:
+            idx = text.find(s, cursor)
+            if idx == -1:
+                idx = cursor
+            spans.append(_SyllableSpan(text=s, start=idx, end=idx + len(s)))
+            cursor = idx + len(s)
+
+        enc = self.tokenizer(
+            text,
+            return_offsets_mapping=True,
+            truncation=True,
+            max_length=self.max_length,
+            return_tensors=None,
+            padding=False,
+        )
+        T = len(enc["input_ids"])
+        S = len(spans)
+        mask = np.zeros((S, T), dtype=np.float32)
+        for t, (cs, ce) in enumerate(enc["offset_mapping"]):
+            if cs == ce:
+                continue
+            for s_idx, span in enumerate(spans):
+                if cs < span.end and ce > span.start:
+                    mask[s_idx, t] = 1.0
+        valid = mask.sum(axis=1) > 0
+        if not valid.any():
+            return [0.0] * S, spans
+
+        torch = self._torch
+        input_ids_t = torch.tensor(enc["input_ids"]).unsqueeze(0).to(self.device)
+        am_t = torch.tensor(enc["attention_mask"]).unsqueeze(0).to(self.device)
+        mask_t = torch.from_numpy(mask).unsqueeze(0).to(self.device)
+        logits = self.model.predict_logits(input_ids_t, am_t, mask_t)
+        probs = torch.sigmoid(logits[0]).cpu().numpy()
+        probs[~valid] = 0.0
+        return probs.tolist(), spans