feat: add Malayalam (ml) as a target language — research, glossary strategy & model recommendation

## Summary

A native Malayalam-speaking contributor — a financial-literacy educator in Kerala doing non-commercial, pro-bono educational work — has asked to adapt QuantEcon programming and economics material for Malayalam learners. This issue captures the research we did into whether and how we can add **Malayalam (`ml`)** as a target language in `action-translation`, and recommends how to best support it (model tier, glossary strategy, and the role of Indic-specialist models). Mechanically, adding the language is the same path we already used for French (#68) and Japanese (#69); the substance below is about getting the quality right for a genuinely under-resourced language pair (en→ml).

**TL;DR — 🟡 feasible and worthwhile, with guardrails.** en→ml hits a real resource cliff in the direction this tool runs in — and the scarcity is in the *training data and tooling*, not the language. Ship it as a human-reviewed draft pipeline, default `ml` to **Opus** rather than the global Sonnet, and **redesign the glossary** with a three-way translate/transliterate/keep-English policy instead of cloning the zh-cn/fa pattern.

> **A note on terminology.** Throughout, "under-resourced" refers narrowly to the availability of *digital and NLP resources* — parallel corpora, benchmarks, term banks, and pretrained models — **not** to the language itself. Malayalam is a classical language of India with ~38 million speakers and a deep literary and scholarly tradition. The gap this issue addresses is in the **tooling and training data the AI industry has invested in it so far** — and the point of this work is to help close that gap.

---

## Research findings (frontloaded)

### 1. Translation-quality landscape

The dominant signal across every Indic benchmark is the well-resourced vs under-resourced gap, and Malayalam sits well below the best-resourced Indic languages. Crucially, the weak direction is exactly ours: generating Malayalam script and morphology (en→ml) is far harder than reading it (ml→en).

- **The en→ml generation direction collapses for general LLMs.** On IndicGenBench (FLORES-IN, chrF, one-shot), GPT-4's en→ml score was **28.4 vs PaLM-2-L's 59.7 and BLOOMZ's 66.2** — roughly half. ml→en is much closer (GPT-4 58.6 vs PaLM-2 68.0). This tool runs en→ml, so this is the relevant cliff.
- **Understanding does not imply generation.** GPT-4o leads Malayalam *comprehension* on MILU at ~71%, but that is still ~10 points below English and ~6 below Hindi — and comprehension ≠ writing fluent, correct Malayalam.
- **Dedicated NMT beats general LLMs on generation by a wide margin.** On FLORES-200, GPT-3.5 ml→en chrF was 50.10 vs IndicTrans2 66.20 and Google 66.40 — a ~16 chrF gap even in the *easier* direction.
- **Claude-on-Malayalam is unmeasured.** Anthropic publishes no per-language Malayalam translation scores. Its only public multilingual table is 14 languages, MMLU *understanding* relative to English, and Malayalam is absent — the only Indic entries are the two best-resourced (Hindi 96.7%, Bengali 95.4%, on Sonnet 4.5). Claude's strong MT reputation (WMT24 wins, blind professional ratings) is built entirely on high-resource European/East-Asian pairs. **Confidence that this transfers to Malayalam is low** and should be communicated that way to the contributor.
- **Failure modes are intrinsic, not transient:** agglutinative morphology, complex sandhi (word-joining), long compounds, native-script generation, and English code-mixing. Our content adds a second axis of difficulty — dense quantitative-economics and Python prose, a domain Malayalam corpora barely cover.
- **Scaling helps the data-scarce floor more than the high-resource ceiling, but does not close the gap.** Relative gains from larger models are largest exactly where training data is thinnest, but no frontier model exceeds ~58% average on the hardest Indic languages. Bigger models reduce post-edit burden; they do not eliminate it.

### 2. Glossary strategy — the single most important finding

Real Kerala STEM and finance practice is a **three-way split** driven by how established each concept is — not the uniform full-translation pattern our zh-cn and fa glossaries use (those translate even acronyms, e.g. GDP → 国内生产总值 / تولید ناخالص داخلی). Forcing a native coinage for every imported term would produce technically-correct-but-unread Malayalam — the exact register gap that exists between official term banks and actual classroom/Wikipedia usage. Kerala STEM education genuinely code-mixes English terms ("Manglish" is a real, prevalent register, not sloppiness).

| Category (by `context`) | Treatment | Example |
|---|---|---|
| Established economics / finance / macro-micro concepts | **translate** | inflation → പണപ്പെരുപ്പം, demand → ചോദനം, interest rate → പലിശനിരക്ക് |
| Established core-math concepts | **translate** | determinant → സാരണികം, transpose → പക്ഷാന്തരിതം, row/column → നിര/വരി |
| Imported / computing / named-method terms | **transliterate** (Malayalam script) | matrix → മാട്രിക്സ്, function → ഫങ്ഷൻ, algorithm → അൽഗൊരിതം |
| Proper names (economists, mathematicians; ~40 entries) | **transliterate** | Lionel Robbins → ലൊയ്നൽ റോബിൻസ് (same policy fa/zh already apply) |
| Acronyms | **native expansion** as the `ml` value + transliterated short form | GDP → മൊത്ത ആഭ്യന്തര ഉത്പാദനം (ജി.ഡി.പി) |
| Code tokens, library/package names, file formats, keywords | **keep-english** (Latin) | numpy, pandas, def, for, import, AR(1) |

### 3. Indic-specialist models (IndicTrans2 / Sarvam) — auxiliary, never in the pipeline

The tool is Claude-only and structure-preserving by design, and that constraint is correct. IndicTrans2 (AI4Bharat, MIT-licensed, covers Malayalam) and similar are plain-sentence MT systems with no notion of MyST structure, code fences, math, or directives — wiring them into the pipeline would break the document-preservation guarantees that are the whole point of the tool. They do add value **offline, at authoring time**: seeding the `ml` glossary draft, cross-checking each term against a Claude candidate (agreement → high confidence, divergence → flag for the human reviewer), and optionally back-translating ml→en as a QA spot-check. Treat them as a way to raise the draft floor and route reviewer attention — not as ground truth.

### 4. Community Malayalam LLMs and datasets (assessed, not recommended as the translator)

- **MalayaLLM** (LLaMA-2 7B / Gemma-2 9B community fine-tune): not viable as the translator — too small for faithful technical translation, no structure handling, no published benchmarks, and self-hosting would require re-architecting the Claude-only pipeline. At most a reviewer-side second opinion.
- **Parallel corpora** (Samanantar ~49.7M pairs, BPCC ~230M): mainly useful as glossary-mining sources; their value is largely already baked into IndicTrans2. We are not training a model, so raw scale is low-value to us.
- **University of Surrey / TechLiebe en→ml dataset** (~8k segments, finance/legal/news, with human scores, post-edits, and an error taxonomy): the best-matched single resource for the contributor's finance focus — useful for glossary seeding, few-shot exemplars, and as a ready-made review rubric. Public release was slated for April 2026; confirm availability and license before use.

### Verification note

The translation-quality claims above were independently fact-checked against primary sources (Anthropic docs, MILU/IndicGenBench/FLORES papers). One MILU framing was downgraded from "supported" to "overstated" (it over-attributed Malayalam's score to agglutinative morphology, which the paper names only for Tamil/Telugu). The Claude-on-Malayalam gap and the en→ml weakness are well-supported; the absolute Claude-ml quality remains an inference, not a measurement.

### Sources

- Anthropic multilingual support docs (14-language MMLU-understanding table; Malayalam absent)
- MILU — Multi-task Indic Language Understanding, NAACL 2025 (arXiv 2411.02538)
- IndicGenBench, ACL 2024 (arXiv 2404.16816)
- "Assessing Translation capabilities of LLMs…", EAMT 2024 (arXiv 2311.09216)
- IndicTrans2 (arXiv 2305.16307; https://github.com/AI4Bharat/IndicTrans2)
- Analysis of Indic capabilities / IndicParam scaling (arXiv 2501.13912, 2512.00333)
- Malayalam Wikipedia term conventions; Wiktionary "About Malayalam"; Kerala Bhasha Institute / CSTT term banks; fincash Malayalam financial-literacy material
- MalayaLLM (https://github.com/VishnuPJ/MalayaLLM); Surrey/TechLiebe en→ml dataset (TechXplore, Jan 2026)

---

## How we might best support Malayalam in this tool

### Model: per-language Opus override

Default `ml` to **Opus** (`claude-opus-4-8`) rather than inheriting the global `claude-sonnet-4-6`. The cost premium (~1.7×) is negligible at one contributor's volume, scaling buys the most where training data is thinnest, and the dominant cost here is the contributor's review time — which better raw output directly reduces. Keep Sonnet for the proven zh-cn/fa. This needs:

- A **per-language model override** (model is currently a single global input in `src/inputs.ts` / `action.yml`), and
- Adding `claude-opus-4-8` to the `VALID_MODEL_PATTERNS` allowlist in `src/inputs.ts` (today it only warns past `claude-opus-4-6`; the warning is non-blocking but should be cleaned up).

### Glossary: redesign, don't clone

Add an ml-specific per-term `treatment` field (`translate | transliterate | keep-english`) to the glossary objects, defaulted by `context`, applying the three-way split in the table above. This is an additive change to the existing `{en, <lang>, context}` schema; zh-cn/fa keep their current shape. Land `glossary/ml.json` as a 🟡 machine draft via the established draft-then-native-review workflow, then route to the contributor for review before promoting to ✅.

### Process: native review is the quality gate

For a language this under-served by existing MT resources, with zero published Claude baseline, native-speaker review is not a nicety — it is the quality bar. Wire the contributor's review in as a required step, treat the first lectures as a calibration batch, and use reviewer edit-effort as our de-facto quality metric (since none is published). Feed their corrections back into the glossary.

### Keep the pipeline Claude-only and structure-safe

Use IndicTrans2 offline to seed/cross-check the glossary; never in the live path. Add a lint check that Latin code tokens, package names, and file formats survive verbatim — the morphological-correctness pressure on Malayalam makes leakage likelier than for fa/zh.

---

## Proposed implementation checklist

- [ ] Add an `'ml'` entry to `LANGUAGE_CONFIGS` in `src/language-config.ts` (Malayalam punctuation/register rules)
- [ ] Add `treatment` field support for ml glossary terms (schema + translator prompt wiring)
- [ ] Generate `glossary/ml.json` as a machine draft (357-term set) using the three-way `treatment` policy; seed/cross-check with IndicTrans2
- [ ] Add a per-language model override and default `ml` → `claude-opus-4-8`; add it to the model allowlist
- [ ] `npm run build` to bundle `dist-action/glossary/ml.json` (so the dist-freshness CI guard passes and the loader auto-resolves it)
- [ ] Add a code/structure lint (Latin code tokens survive verbatim)
- [ ] Tests + docs (`docs/user/language-config.md`, `docs/user/glossary.md`, `glossary/README.md`)
- [ ] Route the draft glossary + first lectures to the native-speaker contributor for review (required gate)

## Open questions for maintainers

- Is a per-language model override worth adding now, or do we ship `ml` on a globally-overridden Opus run for the first batch and generalize later?
- Do we want the `treatment` field to be ml-only, or is it worth generalizing the glossary schema across languages?
- Should we wait on the Surrey en→ml dataset (license TBD) before drafting the glossary, or draft now and reconcile later?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add Malayalam (ml) as a target language — research, glossary strategy & model recommendation #70

Summary

Research findings (frontloaded)

1. Translation-quality landscape

2. Glossary strategy — the single most important finding

3. Indic-specialist models (IndicTrans2 / Sarvam) — auxiliary, never in the pipeline

4. Community Malayalam LLMs and datasets (assessed, not recommended as the translator)

Verification note

Sources

How we might best support Malayalam in this tool

Model: per-language Opus override

Glossary: redesign, don't clone

Process: native review is the quality gate

Keep the pipeline Claude-only and structure-safe

Proposed implementation checklist

Open questions for maintainers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category (by `context`)	Treatment	Example
Established economics / finance / macro-micro concepts	translate	inflation → പണപ്പെരുപ്പം, demand → ചോദനം, interest rate → പലിശനിരക്ക്
Established core-math concepts	translate	determinant → സാരണികം, transpose → പക്ഷാന്തരിതം, row/column → നിര/വരി
Imported / computing / named-method terms	transliterate (Malayalam script)	matrix → മാട്രിക്സ്, function → ഫങ്ഷൻ, algorithm → അൽഗൊരിതം
Proper names (economists, mathematicians; ~40 entries)	transliterate	Lionel Robbins → ലൊയ്നൽ റോബിൻസ് (same policy fa/zh already apply)
Acronyms	native expansion as the `ml` value + transliterated short form	GDP → മൊത്ത ആഭ്യന്തര ഉത്പാദനം (ജി.ഡി.പി)
Code tokens, library/package names, file formats, keywords	keep-english (Latin)	numpy, pandas, def, for, import, AR(1)

Uh oh!

Uh oh!

feat: add Malayalam (ml) as a target language — research, glossary strategy & model recommendation #70

Description

Summary

Research findings (frontloaded)

1. Translation-quality landscape

2. Glossary strategy — the single most important finding

3. Indic-specialist models (IndicTrans2 / Sarvam) — auxiliary, never in the pipeline

4. Community Malayalam LLMs and datasets (assessed, not recommended as the translator)

Verification note

Sources

How we might best support Malayalam in this tool

Model: per-language Opus override

Glossary: redesign, don't clone

Process: native review is the quality gate

Keep the pipeline Claude-only and structure-safe

Proposed implementation checklist

Open questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions