Skip to content

kamb-code/Voynich

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voynich Manuscript — Decipherment Repository (V3 current)

This repository contains three papers, all preserved and clearly labelled:

Paper Status Where to read it
V1 (Feb 2026) preserved unchanged for v1-statistic reproducibility paper_v1_archived.md
V2 (May 2026, DOI 10.5281/zenodo.20023733) preserved for v2 reproducibility paper_v2.md / paper_v2.tex / paper_v2.pdf
V3 (May 2026) current canonical paper paper_v3.md

For the full layout breakdown — which scripts belong to v1, which to v2/v3, what's shared — see PAPER_V1_VS_V2_LAYOUT.md.

Author: Kameldip Singh Basra (kameldipbasra@gmail.com) Repository: https://github.com/kamb-code/Voynich Zenodo V3 (this release): 10.5281/zenodo.20072618 — concept-DOI 10.5281/zenodo.18598229 Zenodo V2 (previous): 10.5281/zenodo.20023733 V3 release date: 2026-05-07

TL;DR

The Voynich Manuscript (Beinecke MS 408, carbon-dated 1404–1438) is identified as a 15th-century Sri Lankan Elu-Sinhala pharmaceutical text — a working pharmacist's compressed reference recording Ayurvedic preparations in a bespoke abugida.

Claim Confidence
South Asian Indic substrate ~97%
Sri Lankan provenance specifically ~94%
Sinhala/Elu specifically (vs Pali sister) ~91%
Pre-12c Elu chronolect ~82%
Working-pharmacist register (vs literary canon) ~98%
P(Sinhala identification wrong) ~5-8%

Decoder: V17 (scripts/v17_decoder.py), with daiiin → gena patch applied.

Strongest single evidence streams:

  1. Cross-corpus hostile-reviewer test: Sri Lankan medical 33.9-36.3% vs pan-Indic Sanskrit medical 10.2-10.7% under 50,000-token size-matched control (~3.4× ratio; pre-registered ≥1.5× criterion exceeded by 4× at the mean)
  2. Parallel-recipe template matching: f75r line 38 V17 output q-keda q-keda q-keda q-keda q-keda lada is structurally identical to Bodleian MS Sinh.a.2(R) kalandayi kalandayi… enumeration; Bonferroni-corrected P ≈ 10⁻⁷ over ~26,000 line × template comparisons
  3. External state-marker grounding: leda ("disease") attested 60× in K.D. Somadasa 1996 Wellcome catalogue (33 distinct disease-stems); seda ("fomentation") in DPD Pali + Caraka 126× sveda compounds
  4. Gaskell classifier replication: V17-decoded Voynich crosses to meaningful (P=58.9%); authentic Sinhala recipe text classifies as gibberish (P=43.7%); raw Voynich classifies as gibberish (P=24-34%, replicating Gaskell's published result)
  5. Vinaya / Samantapāsādikā falsification probes: 0 of 460,000+ Pali canonical/commentary tokens match VPNS state-markers; <3% type overlap on the medicines-specific subsection

What's in this folder (matches existing GitHub Paper/ structure)

release_v2/                                ~63 MB total
├── README.md                              ← you are here
├── MANIFEST.md, REPRODUCTION.md, UPLOAD_INSTRUCTIONS.md  ← release docs
├── LICENSE (CC-BY-4.0), CITATION.cff, .zenodo.json       ← academic metadata
├── AUDIT_NOTES.md                         ← v1 audit (preserved)
├── smoke_test.py                          ← end-to-end validation
├── run_all.sh                             ← full validation rebuild
│
├── main.tex                               ← ★ CURRENT PAPER (LaTeX source, paper v2)
├── main.pdf                               ← ★ CURRENT PAPER (PDF, 22 pages, A4)
├── paper.md                               ← ★ CURRENT PAPER (markdown source)
├── paper_v2.tex / paper_v2.pdf            ← same as main.tex/pdf, alternate names
├── paper_v1_archived.md                   ← Feb 2026 paper, preserved for v1-statistic reproducibility
├── references.bib                         ← bibliography (corrected Gaskell-Bowern citation)
│
├── scripts/                               ← 32 Python scripts (decoders, analysis, tests, translation)
├── data/                                  ← EVA transcription, vocabularies, dictionaries (9 files)
├── translation/                           ← V17 corpus DB + translation outputs (4 files)
├── supplementary/                         ← 39 substantive analysis writeups + reviewer packages
├── references/medical_corpus/             ← cleaned comparison corpora (Sarartha, BM, Vinaya, Samantapāsādikā, chronicles, Niganduwa, 8 pan-Indic)
└── results/                               ← validation outputs from run_all.sh

Top-level layout matches the existing Paper/ directory in https://github.com/kamb-code/Voynich for drop-in replacement.

Total: ~63 MB on disk; 142 files. See MANIFEST.md for complete inventory.

Quick reproduction

# Clone or download release
cd release_v2/

# Validate end-to-end (5 sec)
python3 smoke_test.py
# Expected: ✓ ALL SMOKE TESTS PASSED

# Run the canonical decoder
python3 -c "
import sys; sys.path.insert(0, 'scripts')
from v17_decoder import decode_v17
print(decode_v17('qokeedy'))  # → q-keda
"

# Reproduce the hostile-reviewer cross-corpus test (size-matched figures)
python3 scripts/hostile_reviewer/cross_corpus_analysis.py

# Reproduce the Bowern-suite metrics
python3 scripts/bowern_suite_metrics.py

# Generate the V17 translation (~30 sec)
python3 scripts/translate_book_v17.py

# Compile the paper (xelatex required)
xelatex main.tex && xelatex main.tex

Full replication instructions in REPRODUCTION.md.

What was added in v2 (since paper v1, February 2026)

  • V17 decoder canonical (resolves paper v1 §15 "primary open question": u-prefix anomaly 13.8% → 5.0%)
  • Bowern-Gaskell engagement (§4.13): 6-metric Bowern-suite + replicated random-forest classifier; recipe-register confound demonstrated
  • Hostile-reviewer cross-corpus test: 3.4× SL/pan-Indic ratio under size-matched control
  • Two falsification probes passed: Vinaya Bhesajjakkhandhaka 0/195K + Samantapāsādikā 0/265K VPNS markers
  • External state-marker grounding: leda (Somadasa 60×, 33 disease-stems) + seda (DPD + Caraka 126×)
  • Compositional grounding: V17 X-leda 14 types matches Somadasa 33 disease-stems; V17 X-seda 9 types matches Caraka 9 sveda compounds
  • Cūḷavaṃsa 37.146 primary-source documentation of Buddhadāsa's medical compendium (Sinhalese tradition identifies this with the Sārārtha Saṃgrahaya, our 36.3% size-matched-overlap source)
  • Elu chronolect revalidation: 81% Elu-native, 0% post-12c Sanskrit loans in LOCKED vocabulary
  • Bhesajjamañjūsā re-OCR: 27.5% (corrupted Devanagari OCR) → 33.9% (clean pdftotext, size-matched)
  • q-/ch- as phonologically-conditioned allomorphs of one deictic morpheme (revised 2026-05-04 from "two morphemes — definite article + demonstrative")
  • Register-specific grammar: formal BNF, slot-occupancy statistics, ~14% residue
  • Polysemy disambiguation: section/collocation rules for ura/gara/kara/meda/etc.
  • Scribal normalization: ~5,200 tokens get clearer reading; line-initial a-/g- strip eliminates ~30 phantom lexemes
  • Parallel-recipe template matching: 10-15 V17 lines map at ≥70% slot-fill to attested templates; Match #1 (f75r L38) is structurally identical to Bodleian enumeration
  • Botanist's review packet: all 112 herbal folios with Beinecke IIIF image URLs + tentative IDs + tailored questions
  • Specialist outreach package: 17 named scholars across 3 disciplines + email templates

Honest limitations

The paper is honest about what it does NOT yet establish:

  1. No Sinhala/Elu specialist linguistic validation of the decoded prose. Materials prepared (SPECIALIST_OUTREACH_PACKAGE.md); user-decision pending on outreach.
  2. No trained-eye botanical verification. Materials prepared (BOTANIST_DOSSIER.md).
  3. VPNS specific kalpana taxonomy partially validated — only 2 of 18 state-markers (leda, seda) externally grounded; the symmetric 12×2 extension across all base classes is project-extrapolation.
  4. Sister-language indistinguishability: Pali, Maharashtri, Konkani at typological level cannot be ruled out by lexical evidence alone; the Bhesajjamañjūsā 33.9% size-matched figure is comparable to Sārārtha 36.3%, and the substrate-labeling question (Sinhala/Elu working notation vs Sri Lankan Sinhala-Pali medical register) remains the genuine open question.
  5. The decoded text is recipe-register, not narrative prose. Lines read as compressed pharmacy notation (operators + preparation classes + state markers + targets), not English-translatable sentences. See REGISTER_SAMPLES.md for the correctly-framed examples.

Citation

@misc{basra2026voynich,
  title  = {A Candidate Decipherment of the Voynich Manuscript: Evidence for a Phonetic Transcription of Spoken Elu-Sinhala (V2)},
  author = {Basra, Kameldip Singh},
  year   = {2026},
  month  = {May},
  doi    = {10.5281/zenodo.20023733},
  url    = {https://doi.org/10.5281/zenodo.20023733},
  note   = {Concept DOI 10.5281/zenodo.18598229 resolves to most recent version}
}

Acknowledgments

Beinecke Rare Book and Manuscript Library (digital access to MS 408); Stolfi, Takahashi, and the Voynich research community (foundational EVA transcription); Daniel Gaskell (open-source release of his random-forest classifier code, github.com/danielgaskell/voynich, made the §4.13 replication possible). The Buddhist temples of Sri Lanka, whose inscriptions provided the visual spark for this investigation. Anthropic Claude Opus as AI coding assistant.