Skip to content

feat: add Malayalam (ml) language support — keep-English-dominant policy#71

Draft
mmcky wants to merge 1 commit into
mainfrom
feat/malayalam-ml
Draft

feat: add Malayalam (ml) language support — keep-English-dominant policy#71
mmcky wants to merge 1 commit into
mainfrom
feat/malayalam-ml

Conversation

@mmcky

@mmcky mmcky commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds Malayalam (ml) as a target language, following the native-speaker review on #70 by Adisankar Manoj Thanuja (Kerala finance educator). Refs #70 — this is the first increment; it does not close the tracking issue.

Unlike zh-cn/fa, which fully translate technical terms, Malayalam uses a keep-English-dominant policy. Kerala has no Malayalam-medium STEM/finance education infrastructure and a globally-mobile, English-fluent workforce; learners use English technical terminology natively, so translating or transliterating it reads archaic ("government gazette," not classroom). Technical terms stay in English with Malayalam grammar wrapping around them; only everyday connective words are translated.

Policy

Category Treatment
Technical terms — economics, finance, statistics, programming keep English
Acronyms (GDP, RBI), institutions (Federal Reserve) keep English
Proper names (economists, researchers) keep English/Latin
Everyday non-technical words (country, year, before) translate to Malayalam
Transliteration effectively unused

The characteristic pattern is Malayalam grammar attached to English roots inline — economy-യിലെ, bond-ന്റെ, asset classes-ൽ — and English verbs with a Malayalam auxiliary (process ചെയ്ത്, return ചെയ്യുന്നു). This English-root-plus-Malayalam-inflection behaviour is an LLM strength a sentence-level NMT system can't replicate.

What's in this PR

  • src/language-config.tsml entry with six keep-English rules (keep technical terms English, attach Malayalam morphology to English roots, keep proper names in Latin, enforce per-document consistency, classroom register with sparing parenthetical-first-use).
  • glossary/ml.json — 60-term seed: 47 technical terms pinned to English (ml == en) + 13 everyday words translated, every Malayalam value taken directly from the reviewer's worked examples (nothing machine-guessed).
  • dist-action/ — rebuilt bundle (npm run build), including dist-action/glossary/ml.json.

Status: 🟡 draft / seed pending native-speaker validation

This is intentionally a seed for a calibration batch: run real lectures on this config (Opus, ml) and have Adisankar flag each over-translation so we tune the glossary/rules from real output rather than theory. The primary risk for ml is consistency (a term must be handled identically on every occurrence), not over-translation.

Decisions still open (see #70)

  • Proper-names policy — defaulted to keep-English/Latin; reviewer to confirm.
  • Parenthetical-on-first-use convention (ബന്ധം (relationship)) — keep sparing.
  • v1 stays zero-schema-change (ml == en); the per-term treatment field is deferred unless calibration shows we need it.

Follow-ups (not in this PR)

  • Per-language Opus default for ml + add claude-opus-4-8 to the model allowlist in src/inputs.ts.
  • Tests for the ml config/glossary.
  • Docs: docs/user/language-config.md, docs/user/glossary.md, glossary/README.md.

Verification

  • npm run build — clean; dist-action/ bundle regenerated and committed (CI freshness guard).
  • npm run lint — 0 errors (pre-existing warnings only).
  • npx jest on language-config, inputs, sync-orchestrator, translator, integration, cli-smoke — 166 tests pass.
  • glossary/ml.json — valid JSON, 60 terms (47 keep-English, 13 translated).

🤖 Generated with Claude Code

Adds Malayalam as a target language following native-speaker review
(issue #70). Unlike zh-cn/fa, which fully translate technical terms,
Malayalam uses a keep-English-dominant policy: Kerala STEM/finance
learners use English technical terminology natively, so translating or
transliterating it reads archaic. Technical terms stay in English with
Malayalam grammar wrapping around them; only everyday connective words
are translated.

- src/language-config.ts: add `ml` entry with keep-English rules
  (keep technical terms English, attach Malayalam morphology to English
  roots, keep proper names in Latin, enforce per-document consistency).
- glossary/ml.json: 60-term seed draft — 47 technical terms pinned to
  English (ml == en), 13 everyday words translated with Malayalam values
  taken from the reviewer's worked examples.
- dist-action/: rebuilt bundle (npm run build) incl. glossary/ml.json.

Seed draft pending native-speaker validation via a calibration batch.
Follow-ups (see #70): per-language Opus default + allowlist entry,
tests, docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant