feat: add Malayalam (ml) language support — keep-English-dominant policy#71
Draft
mmcky wants to merge 1 commit into
Draft
feat: add Malayalam (ml) language support — keep-English-dominant policy#71mmcky wants to merge 1 commit into
mmcky wants to merge 1 commit into
Conversation
Adds Malayalam as a target language following native-speaker review (issue #70). Unlike zh-cn/fa, which fully translate technical terms, Malayalam uses a keep-English-dominant policy: Kerala STEM/finance learners use English technical terminology natively, so translating or transliterating it reads archaic. Technical terms stay in English with Malayalam grammar wrapping around them; only everyday connective words are translated. - src/language-config.ts: add `ml` entry with keep-English rules (keep technical terms English, attach Malayalam morphology to English roots, keep proper names in Latin, enforce per-document consistency). - glossary/ml.json: 60-term seed draft — 47 technical terms pinned to English (ml == en), 13 everyday words translated with Malayalam values taken from the reviewer's worked examples. - dist-action/: rebuilt bundle (npm run build) incl. glossary/ml.json. Seed draft pending native-speaker validation via a calibration batch. Follow-ups (see #70): per-language Opus default + allowlist entry, tests, docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Malayalam (
ml) as a target language, following the native-speaker review on #70 by Adisankar Manoj Thanuja (Kerala finance educator). Refs #70 — this is the first increment; it does not close the tracking issue.Unlike
zh-cn/fa, which fully translate technical terms, Malayalam uses a keep-English-dominant policy. Kerala has no Malayalam-medium STEM/finance education infrastructure and a globally-mobile, English-fluent workforce; learners use English technical terminology natively, so translating or transliterating it reads archaic ("government gazette," not classroom). Technical terms stay in English with Malayalam grammar wrapping around them; only everyday connective words are translated.Policy
The characteristic pattern is Malayalam grammar attached to English roots inline —
economy-യിലെ,bond-ന്റെ,asset classes-ൽ— and English verbs with a Malayalam auxiliary (process ചെയ്ത്,return ചെയ്യുന്നു). This English-root-plus-Malayalam-inflection behaviour is an LLM strength a sentence-level NMT system can't replicate.What's in this PR
src/language-config.ts—mlentry with six keep-English rules (keep technical terms English, attach Malayalam morphology to English roots, keep proper names in Latin, enforce per-document consistency, classroom register with sparing parenthetical-first-use).glossary/ml.json— 60-term seed: 47 technical terms pinned to English (ml == en) + 13 everyday words translated, every Malayalam value taken directly from the reviewer's worked examples (nothing machine-guessed).dist-action/— rebuilt bundle (npm run build), includingdist-action/glossary/ml.json.Status: 🟡 draft / seed pending native-speaker validation
This is intentionally a seed for a calibration batch: run real lectures on this config (Opus,
ml) and have Adisankar flag each over-translation so we tune the glossary/rules from real output rather than theory. The primary risk formlis consistency (a term must be handled identically on every occurrence), not over-translation.Decisions still open (see #70)
ബന്ധം (relationship)) — keep sparing.ml == en); the per-termtreatmentfield is deferred unless calibration shows we need it.Follow-ups (not in this PR)
ml+ addclaude-opus-4-8to the model allowlist insrc/inputs.ts.mlconfig/glossary.docs/user/language-config.md,docs/user/glossary.md,glossary/README.md.Verification
npm run build— clean;dist-action/bundle regenerated and committed (CI freshness guard).npm run lint— 0 errors (pre-existing warnings only).npx jeston language-config, inputs, sync-orchestrator, translator, integration, cli-smoke — 166 tests pass.glossary/ml.json— valid JSON, 60 terms (47 keep-English, 13 translated).🤖 Generated with Claude Code