Skip to content

feat: add Malayalam (ml) as a target language — research, glossary strategy & model recommendation #70

Description

@mmcky

Summary

A native Malayalam-speaking contributor — a financial-literacy educator in Kerala doing non-commercial, pro-bono educational work — has asked to adapt QuantEcon programming and economics material for Malayalam learners. This issue captures the research we did into whether and how we can add Malayalam (ml) as a target language in action-translation, and recommends how to best support it (model tier, glossary strategy, and the role of Indic-specialist models). Mechanically, adding the language is the same path we already used for French (#68) and Japanese (#69); the substance below is about getting the quality right for a genuinely under-resourced language pair (en→ml).

TL;DR — 🟡 feasible and worthwhile, with guardrails. en→ml hits a real resource cliff in the direction this tool runs in — and the scarcity is in the training data and tooling, not the language. Ship it as a human-reviewed draft pipeline, default ml to Opus rather than the global Sonnet, and redesign the glossary with a three-way translate/transliterate/keep-English policy instead of cloning the zh-cn/fa pattern.

A note on terminology. Throughout, "under-resourced" refers narrowly to the availability of digital and NLP resources — parallel corpora, benchmarks, term banks, and pretrained models — not to the language itself. Malayalam is a classical language of India with ~38 million speakers and a deep literary and scholarly tradition. The gap this issue addresses is in the tooling and training data the AI industry has invested in it so far — and the point of this work is to help close that gap.


Research findings (frontloaded)

1. Translation-quality landscape

The dominant signal across every Indic benchmark is the well-resourced vs under-resourced gap, and Malayalam sits well below the best-resourced Indic languages. Crucially, the weak direction is exactly ours: generating Malayalam script and morphology (en→ml) is far harder than reading it (ml→en).

  • The en→ml generation direction collapses for general LLMs. On IndicGenBench (FLORES-IN, chrF, one-shot), GPT-4's en→ml score was 28.4 vs PaLM-2-L's 59.7 and BLOOMZ's 66.2 — roughly half. ml→en is much closer (GPT-4 58.6 vs PaLM-2 68.0). This tool runs en→ml, so this is the relevant cliff.
  • Understanding does not imply generation. GPT-4o leads Malayalam comprehension on MILU at ~71%, but that is still ~10 points below English and ~6 below Hindi — and comprehension ≠ writing fluent, correct Malayalam.
  • Dedicated NMT beats general LLMs on generation by a wide margin. On FLORES-200, GPT-3.5 ml→en chrF was 50.10 vs IndicTrans2 66.20 and Google 66.40 — a ~16 chrF gap even in the easier direction.
  • Claude-on-Malayalam is unmeasured. Anthropic publishes no per-language Malayalam translation scores. Its only public multilingual table is 14 languages, MMLU understanding relative to English, and Malayalam is absent — the only Indic entries are the two best-resourced (Hindi 96.7%, Bengali 95.4%, on Sonnet 4.5). Claude's strong MT reputation (WMT24 wins, blind professional ratings) is built entirely on high-resource European/East-Asian pairs. Confidence that this transfers to Malayalam is low and should be communicated that way to the contributor.
  • Failure modes are intrinsic, not transient: agglutinative morphology, complex sandhi (word-joining), long compounds, native-script generation, and English code-mixing. Our content adds a second axis of difficulty — dense quantitative-economics and Python prose, a domain Malayalam corpora barely cover.
  • Scaling helps the data-scarce floor more than the high-resource ceiling, but does not close the gap. Relative gains from larger models are largest exactly where training data is thinnest, but no frontier model exceeds ~58% average on the hardest Indic languages. Bigger models reduce post-edit burden; they do not eliminate it.

2. Glossary strategy — the single most important finding

Real Kerala STEM and finance practice is a three-way split driven by how established each concept is — not the uniform full-translation pattern our zh-cn and fa glossaries use (those translate even acronyms, e.g. GDP → 国内生产总值 / تولید ناخالص داخلی). Forcing a native coinage for every imported term would produce technically-correct-but-unread Malayalam — the exact register gap that exists between official term banks and actual classroom/Wikipedia usage. Kerala STEM education genuinely code-mixes English terms ("Manglish" is a real, prevalent register, not sloppiness).

Category (by context) Treatment Example
Established economics / finance / macro-micro concepts translate inflation → പണപ്പെരുപ്പം, demand → ചോദനം, interest rate → പലിശനിരക്ക്
Established core-math concepts translate determinant → സാരണികം, transpose → പക്ഷാന്തരിതം, row/column → നിര/വരി
Imported / computing / named-method terms transliterate (Malayalam script) matrix → മാട്രിക്സ്, function → ഫങ്ഷൻ, algorithm → അൽഗൊരിതം
Proper names (economists, mathematicians; ~40 entries) transliterate Lionel Robbins → ലൊയ്നൽ റോബിൻസ് (same policy fa/zh already apply)
Acronyms native expansion as the ml value + transliterated short form GDP → മൊത്ത ആഭ്യന്തര ഉത്പാദനം (ജി.ഡി.പി)
Code tokens, library/package names, file formats, keywords keep-english (Latin) numpy, pandas, def, for, import, AR(1)

3. Indic-specialist models (IndicTrans2 / Sarvam) — auxiliary, never in the pipeline

The tool is Claude-only and structure-preserving by design, and that constraint is correct. IndicTrans2 (AI4Bharat, MIT-licensed, covers Malayalam) and similar are plain-sentence MT systems with no notion of MyST structure, code fences, math, or directives — wiring them into the pipeline would break the document-preservation guarantees that are the whole point of the tool. They do add value offline, at authoring time: seeding the ml glossary draft, cross-checking each term against a Claude candidate (agreement → high confidence, divergence → flag for the human reviewer), and optionally back-translating ml→en as a QA spot-check. Treat them as a way to raise the draft floor and route reviewer attention — not as ground truth.

4. Community Malayalam LLMs and datasets (assessed, not recommended as the translator)

  • MalayaLLM (LLaMA-2 7B / Gemma-2 9B community fine-tune): not viable as the translator — too small for faithful technical translation, no structure handling, no published benchmarks, and self-hosting would require re-architecting the Claude-only pipeline. At most a reviewer-side second opinion.
  • Parallel corpora (Samanantar ~49.7M pairs, BPCC ~230M): mainly useful as glossary-mining sources; their value is largely already baked into IndicTrans2. We are not training a model, so raw scale is low-value to us.
  • University of Surrey / TechLiebe en→ml dataset (~8k segments, finance/legal/news, with human scores, post-edits, and an error taxonomy): the best-matched single resource for the contributor's finance focus — useful for glossary seeding, few-shot exemplars, and as a ready-made review rubric. Public release was slated for April 2026; confirm availability and license before use.

Verification note

The translation-quality claims above were independently fact-checked against primary sources (Anthropic docs, MILU/IndicGenBench/FLORES papers). One MILU framing was downgraded from "supported" to "overstated" (it over-attributed Malayalam's score to agglutinative morphology, which the paper names only for Tamil/Telugu). The Claude-on-Malayalam gap and the en→ml weakness are well-supported; the absolute Claude-ml quality remains an inference, not a measurement.

Sources

  • Anthropic multilingual support docs (14-language MMLU-understanding table; Malayalam absent)
  • MILU — Multi-task Indic Language Understanding, NAACL 2025 (arXiv 2411.02538)
  • IndicGenBench, ACL 2024 (arXiv 2404.16816)
  • "Assessing Translation capabilities of LLMs…", EAMT 2024 (arXiv 2311.09216)
  • IndicTrans2 (arXiv 2305.16307; https://github.com/AI4Bharat/IndicTrans2)
  • Analysis of Indic capabilities / IndicParam scaling (arXiv 2501.13912, 2512.00333)
  • Malayalam Wikipedia term conventions; Wiktionary "About Malayalam"; Kerala Bhasha Institute / CSTT term banks; fincash Malayalam financial-literacy material
  • MalayaLLM (https://github.com/VishnuPJ/MalayaLLM); Surrey/TechLiebe en→ml dataset (TechXplore, Jan 2026)

How we might best support Malayalam in this tool

Model: per-language Opus override

Default ml to Opus (claude-opus-4-8) rather than inheriting the global claude-sonnet-4-6. The cost premium (~1.7×) is negligible at one contributor's volume, scaling buys the most where training data is thinnest, and the dominant cost here is the contributor's review time — which better raw output directly reduces. Keep Sonnet for the proven zh-cn/fa. This needs:

  • A per-language model override (model is currently a single global input in src/inputs.ts / action.yml), and
  • Adding claude-opus-4-8 to the VALID_MODEL_PATTERNS allowlist in src/inputs.ts (today it only warns past claude-opus-4-6; the warning is non-blocking but should be cleaned up).

Glossary: redesign, don't clone

Add an ml-specific per-term treatment field (translate | transliterate | keep-english) to the glossary objects, defaulted by context, applying the three-way split in the table above. This is an additive change to the existing {en, <lang>, context} schema; zh-cn/fa keep their current shape. Land glossary/ml.json as a 🟡 machine draft via the established draft-then-native-review workflow, then route to the contributor for review before promoting to ✅.

Process: native review is the quality gate

For a language this under-served by existing MT resources, with zero published Claude baseline, native-speaker review is not a nicety — it is the quality bar. Wire the contributor's review in as a required step, treat the first lectures as a calibration batch, and use reviewer edit-effort as our de-facto quality metric (since none is published). Feed their corrections back into the glossary.

Keep the pipeline Claude-only and structure-safe

Use IndicTrans2 offline to seed/cross-check the glossary; never in the live path. Add a lint check that Latin code tokens, package names, and file formats survive verbatim — the morphological-correctness pressure on Malayalam makes leakage likelier than for fa/zh.


Proposed implementation checklist

  • Add an 'ml' entry to LANGUAGE_CONFIGS in src/language-config.ts (Malayalam punctuation/register rules)
  • Add treatment field support for ml glossary terms (schema + translator prompt wiring)
  • Generate glossary/ml.json as a machine draft (357-term set) using the three-way treatment policy; seed/cross-check with IndicTrans2
  • Add a per-language model override and default mlclaude-opus-4-8; add it to the model allowlist
  • npm run build to bundle dist-action/glossary/ml.json (so the dist-freshness CI guard passes and the loader auto-resolves it)
  • Add a code/structure lint (Latin code tokens survive verbatim)
  • Tests + docs (docs/user/language-config.md, docs/user/glossary.md, glossary/README.md)
  • Route the draft glossary + first lectures to the native-speaker contributor for review (required gate)

Open questions for maintainers

  • Is a per-language model override worth adding now, or do we ship ml on a globally-overridden Opus run for the first batch and generalize later?
  • Do we want the treatment field to be ml-only, or is it worth generalizing the glossary schema across languages?
  • Should we wait on the Surrey en→ml dataset (license TBD) before drafting the glossary, or draft now and reconcile later?

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions