Summary
A native Malayalam-speaking contributor — a financial-literacy educator in Kerala doing non-commercial, pro-bono educational work — has asked to adapt QuantEcon programming and economics material for Malayalam learners. This issue captures the research we did into whether and how we can add Malayalam (ml) as a target language in action-translation, and recommends how to best support it (model tier, glossary strategy, and the role of Indic-specialist models). Mechanically, adding the language is the same path we already used for French (#68) and Japanese (#69); the substance below is about getting the quality right for a genuinely under-resourced language pair (en→ml).
TL;DR — 🟡 feasible and worthwhile, with guardrails. en→ml hits a real resource cliff in the direction this tool runs in — and the scarcity is in the training data and tooling, not the language. Ship it as a human-reviewed draft pipeline, default ml to Opus rather than the global Sonnet, and redesign the glossary with a three-way translate/transliterate/keep-English policy instead of cloning the zh-cn/fa pattern.
A note on terminology. Throughout, "under-resourced" refers narrowly to the availability of digital and NLP resources — parallel corpora, benchmarks, term banks, and pretrained models — not to the language itself. Malayalam is a classical language of India with ~38 million speakers and a deep literary and scholarly tradition. The gap this issue addresses is in the tooling and training data the AI industry has invested in it so far — and the point of this work is to help close that gap.
Research findings (frontloaded)
1. Translation-quality landscape
The dominant signal across every Indic benchmark is the well-resourced vs under-resourced gap, and Malayalam sits well below the best-resourced Indic languages. Crucially, the weak direction is exactly ours: generating Malayalam script and morphology (en→ml) is far harder than reading it (ml→en).
- The en→ml generation direction collapses for general LLMs. On IndicGenBench (FLORES-IN, chrF, one-shot), GPT-4's en→ml score was 28.4 vs PaLM-2-L's 59.7 and BLOOMZ's 66.2 — roughly half. ml→en is much closer (GPT-4 58.6 vs PaLM-2 68.0). This tool runs en→ml, so this is the relevant cliff.
- Understanding does not imply generation. GPT-4o leads Malayalam comprehension on MILU at ~71%, but that is still ~10 points below English and ~6 below Hindi — and comprehension ≠ writing fluent, correct Malayalam.
- Dedicated NMT beats general LLMs on generation by a wide margin. On FLORES-200, GPT-3.5 ml→en chrF was 50.10 vs IndicTrans2 66.20 and Google 66.40 — a ~16 chrF gap even in the easier direction.
- Claude-on-Malayalam is unmeasured. Anthropic publishes no per-language Malayalam translation scores. Its only public multilingual table is 14 languages, MMLU understanding relative to English, and Malayalam is absent — the only Indic entries are the two best-resourced (Hindi 96.7%, Bengali 95.4%, on Sonnet 4.5). Claude's strong MT reputation (WMT24 wins, blind professional ratings) is built entirely on high-resource European/East-Asian pairs. Confidence that this transfers to Malayalam is low and should be communicated that way to the contributor.
- Failure modes are intrinsic, not transient: agglutinative morphology, complex sandhi (word-joining), long compounds, native-script generation, and English code-mixing. Our content adds a second axis of difficulty — dense quantitative-economics and Python prose, a domain Malayalam corpora barely cover.
- Scaling helps the data-scarce floor more than the high-resource ceiling, but does not close the gap. Relative gains from larger models are largest exactly where training data is thinnest, but no frontier model exceeds ~58% average on the hardest Indic languages. Bigger models reduce post-edit burden; they do not eliminate it.
2. Glossary strategy — the single most important finding
Real Kerala STEM and finance practice is a three-way split driven by how established each concept is — not the uniform full-translation pattern our zh-cn and fa glossaries use (those translate even acronyms, e.g. GDP → 国内生产总值 / تولید ناخالص داخلی). Forcing a native coinage for every imported term would produce technically-correct-but-unread Malayalam — the exact register gap that exists between official term banks and actual classroom/Wikipedia usage. Kerala STEM education genuinely code-mixes English terms ("Manglish" is a real, prevalent register, not sloppiness).
Category (by context) |
Treatment |
Example |
| Established economics / finance / macro-micro concepts |
translate |
inflation → പണപ്പെരുപ്പം, demand → ചോദനം, interest rate → പലിശനിരക്ക് |
| Established core-math concepts |
translate |
determinant → സാരണികം, transpose → പക്ഷാന്തരിതം, row/column → നിര/വരി |
| Imported / computing / named-method terms |
transliterate (Malayalam script) |
matrix → മാട്രിക്സ്, function → ഫങ്ഷൻ, algorithm → അൽഗൊരിതം |
| Proper names (economists, mathematicians; ~40 entries) |
transliterate |
Lionel Robbins → ലൊയ്നൽ റോബിൻസ് (same policy fa/zh already apply) |
| Acronyms |
native expansion as the ml value + transliterated short form |
GDP → മൊത്ത ആഭ്യന്തര ഉത്പാദനം (ജി.ഡി.പി) |
| Code tokens, library/package names, file formats, keywords |
keep-english (Latin) |
numpy, pandas, def, for, import, AR(1) |
3. Indic-specialist models (IndicTrans2 / Sarvam) — auxiliary, never in the pipeline
The tool is Claude-only and structure-preserving by design, and that constraint is correct. IndicTrans2 (AI4Bharat, MIT-licensed, covers Malayalam) and similar are plain-sentence MT systems with no notion of MyST structure, code fences, math, or directives — wiring them into the pipeline would break the document-preservation guarantees that are the whole point of the tool. They do add value offline, at authoring time: seeding the ml glossary draft, cross-checking each term against a Claude candidate (agreement → high confidence, divergence → flag for the human reviewer), and optionally back-translating ml→en as a QA spot-check. Treat them as a way to raise the draft floor and route reviewer attention — not as ground truth.
4. Community Malayalam LLMs and datasets (assessed, not recommended as the translator)
- MalayaLLM (LLaMA-2 7B / Gemma-2 9B community fine-tune): not viable as the translator — too small for faithful technical translation, no structure handling, no published benchmarks, and self-hosting would require re-architecting the Claude-only pipeline. At most a reviewer-side second opinion.
- Parallel corpora (Samanantar ~49.7M pairs, BPCC ~230M): mainly useful as glossary-mining sources; their value is largely already baked into IndicTrans2. We are not training a model, so raw scale is low-value to us.
- University of Surrey / TechLiebe en→ml dataset (~8k segments, finance/legal/news, with human scores, post-edits, and an error taxonomy): the best-matched single resource for the contributor's finance focus — useful for glossary seeding, few-shot exemplars, and as a ready-made review rubric. Public release was slated for April 2026; confirm availability and license before use.
Verification note
The translation-quality claims above were independently fact-checked against primary sources (Anthropic docs, MILU/IndicGenBench/FLORES papers). One MILU framing was downgraded from "supported" to "overstated" (it over-attributed Malayalam's score to agglutinative morphology, which the paper names only for Tamil/Telugu). The Claude-on-Malayalam gap and the en→ml weakness are well-supported; the absolute Claude-ml quality remains an inference, not a measurement.
Sources
- Anthropic multilingual support docs (14-language MMLU-understanding table; Malayalam absent)
- MILU — Multi-task Indic Language Understanding, NAACL 2025 (arXiv 2411.02538)
- IndicGenBench, ACL 2024 (arXiv 2404.16816)
- "Assessing Translation capabilities of LLMs…", EAMT 2024 (arXiv 2311.09216)
- IndicTrans2 (arXiv 2305.16307; https://github.com/AI4Bharat/IndicTrans2)
- Analysis of Indic capabilities / IndicParam scaling (arXiv 2501.13912, 2512.00333)
- Malayalam Wikipedia term conventions; Wiktionary "About Malayalam"; Kerala Bhasha Institute / CSTT term banks; fincash Malayalam financial-literacy material
- MalayaLLM (https://github.com/VishnuPJ/MalayaLLM); Surrey/TechLiebe en→ml dataset (TechXplore, Jan 2026)
How we might best support Malayalam in this tool
Model: per-language Opus override
Default ml to Opus (claude-opus-4-8) rather than inheriting the global claude-sonnet-4-6. The cost premium (~1.7×) is negligible at one contributor's volume, scaling buys the most where training data is thinnest, and the dominant cost here is the contributor's review time — which better raw output directly reduces. Keep Sonnet for the proven zh-cn/fa. This needs:
- A per-language model override (model is currently a single global input in
src/inputs.ts / action.yml), and
- Adding
claude-opus-4-8 to the VALID_MODEL_PATTERNS allowlist in src/inputs.ts (today it only warns past claude-opus-4-6; the warning is non-blocking but should be cleaned up).
Glossary: redesign, don't clone
Add an ml-specific per-term treatment field (translate | transliterate | keep-english) to the glossary objects, defaulted by context, applying the three-way split in the table above. This is an additive change to the existing {en, <lang>, context} schema; zh-cn/fa keep their current shape. Land glossary/ml.json as a 🟡 machine draft via the established draft-then-native-review workflow, then route to the contributor for review before promoting to ✅.
Process: native review is the quality gate
For a language this under-served by existing MT resources, with zero published Claude baseline, native-speaker review is not a nicety — it is the quality bar. Wire the contributor's review in as a required step, treat the first lectures as a calibration batch, and use reviewer edit-effort as our de-facto quality metric (since none is published). Feed their corrections back into the glossary.
Keep the pipeline Claude-only and structure-safe
Use IndicTrans2 offline to seed/cross-check the glossary; never in the live path. Add a lint check that Latin code tokens, package names, and file formats survive verbatim — the morphological-correctness pressure on Malayalam makes leakage likelier than for fa/zh.
Proposed implementation checklist
Open questions for maintainers
- Is a per-language model override worth adding now, or do we ship
ml on a globally-overridden Opus run for the first batch and generalize later?
- Do we want the
treatment field to be ml-only, or is it worth generalizing the glossary schema across languages?
- Should we wait on the Surrey en→ml dataset (license TBD) before drafting the glossary, or draft now and reconcile later?
Summary
A native Malayalam-speaking contributor — a financial-literacy educator in Kerala doing non-commercial, pro-bono educational work — has asked to adapt QuantEcon programming and economics material for Malayalam learners. This issue captures the research we did into whether and how we can add Malayalam (
ml) as a target language inaction-translation, and recommends how to best support it (model tier, glossary strategy, and the role of Indic-specialist models). Mechanically, adding the language is the same path we already used for French (#68) and Japanese (#69); the substance below is about getting the quality right for a genuinely under-resourced language pair (en→ml).TL;DR — 🟡 feasible and worthwhile, with guardrails. en→ml hits a real resource cliff in the direction this tool runs in — and the scarcity is in the training data and tooling, not the language. Ship it as a human-reviewed draft pipeline, default
mlto Opus rather than the global Sonnet, and redesign the glossary with a three-way translate/transliterate/keep-English policy instead of cloning the zh-cn/fa pattern.Research findings (frontloaded)
1. Translation-quality landscape
The dominant signal across every Indic benchmark is the well-resourced vs under-resourced gap, and Malayalam sits well below the best-resourced Indic languages. Crucially, the weak direction is exactly ours: generating Malayalam script and morphology (en→ml) is far harder than reading it (ml→en).
2. Glossary strategy — the single most important finding
Real Kerala STEM and finance practice is a three-way split driven by how established each concept is — not the uniform full-translation pattern our zh-cn and fa glossaries use (those translate even acronyms, e.g. GDP → 国内生产总值 / تولید ناخالص داخلی). Forcing a native coinage for every imported term would produce technically-correct-but-unread Malayalam — the exact register gap that exists between official term banks and actual classroom/Wikipedia usage. Kerala STEM education genuinely code-mixes English terms ("Manglish" is a real, prevalent register, not sloppiness).
context)mlvalue + transliterated short form3. Indic-specialist models (IndicTrans2 / Sarvam) — auxiliary, never in the pipeline
The tool is Claude-only and structure-preserving by design, and that constraint is correct. IndicTrans2 (AI4Bharat, MIT-licensed, covers Malayalam) and similar are plain-sentence MT systems with no notion of MyST structure, code fences, math, or directives — wiring them into the pipeline would break the document-preservation guarantees that are the whole point of the tool. They do add value offline, at authoring time: seeding the
mlglossary draft, cross-checking each term against a Claude candidate (agreement → high confidence, divergence → flag for the human reviewer), and optionally back-translating ml→en as a QA spot-check. Treat them as a way to raise the draft floor and route reviewer attention — not as ground truth.4. Community Malayalam LLMs and datasets (assessed, not recommended as the translator)
Verification note
The translation-quality claims above were independently fact-checked against primary sources (Anthropic docs, MILU/IndicGenBench/FLORES papers). One MILU framing was downgraded from "supported" to "overstated" (it over-attributed Malayalam's score to agglutinative morphology, which the paper names only for Tamil/Telugu). The Claude-on-Malayalam gap and the en→ml weakness are well-supported; the absolute Claude-ml quality remains an inference, not a measurement.
Sources
How we might best support Malayalam in this tool
Model: per-language Opus override
Default
mlto Opus (claude-opus-4-8) rather than inheriting the globalclaude-sonnet-4-6. The cost premium (~1.7×) is negligible at one contributor's volume, scaling buys the most where training data is thinnest, and the dominant cost here is the contributor's review time — which better raw output directly reduces. Keep Sonnet for the proven zh-cn/fa. This needs:src/inputs.ts/action.yml), andclaude-opus-4-8to theVALID_MODEL_PATTERNSallowlist insrc/inputs.ts(today it only warns pastclaude-opus-4-6; the warning is non-blocking but should be cleaned up).Glossary: redesign, don't clone
Add an ml-specific per-term
treatmentfield (translate | transliterate | keep-english) to the glossary objects, defaulted bycontext, applying the three-way split in the table above. This is an additive change to the existing{en, <lang>, context}schema; zh-cn/fa keep their current shape. Landglossary/ml.jsonas a 🟡 machine draft via the established draft-then-native-review workflow, then route to the contributor for review before promoting to ✅.Process: native review is the quality gate
For a language this under-served by existing MT resources, with zero published Claude baseline, native-speaker review is not a nicety — it is the quality bar. Wire the contributor's review in as a required step, treat the first lectures as a calibration batch, and use reviewer edit-effort as our de-facto quality metric (since none is published). Feed their corrections back into the glossary.
Keep the pipeline Claude-only and structure-safe
Use IndicTrans2 offline to seed/cross-check the glossary; never in the live path. Add a lint check that Latin code tokens, package names, and file formats survive verbatim — the morphological-correctness pressure on Malayalam makes leakage likelier than for fa/zh.
Proposed implementation checklist
'ml'entry toLANGUAGE_CONFIGSinsrc/language-config.ts(Malayalam punctuation/register rules)treatmentfield support for ml glossary terms (schema + translator prompt wiring)glossary/ml.jsonas a machine draft (357-term set) using the three-waytreatmentpolicy; seed/cross-check with IndicTrans2ml→claude-opus-4-8; add it to the model allowlistnpm run buildto bundledist-action/glossary/ml.json(so the dist-freshness CI guard passes and the loader auto-resolves it)docs/user/language-config.md,docs/user/glossary.md,glossary/README.md)Open questions for maintainers
mlon a globally-overridden Opus run for the first batch and generalize later?treatmentfield to be ml-only, or is it worth generalizing the glossary schema across languages?