A living, open registry of AI Data — built by and for Latin America.
Open source. Open data. Anyone can add a record.
The "database" is a single file: data.json. It catalogs datasets — corpora used to train or evaluate AI systems, especially LATAM-language data.
Records are indexed by the task they support (transcribe, …), the medium fed in (audio, text, image, …), the medium produced (text, audio, …), the domain (general, medical, legal, finance, …), and the language. The frontend at datahub.html is a searchable explorer with an animated hero and a machine-readable signal view.
The languages vocabulary is the most visible signal of this stance:
- One Spanish variant per LATAM country (
es-AR,es-BO,es-CL,es-CO,es-MX,es-PE,es-UY,es-VE, …) - Brazilian Portuguese (
pt-BR) - Coarse fallbacks (
es,pt) for datasets where the source page doesn't resolve sub-variants — a record reading["es"]means "Spanish, breakdown not stated";["es-AR", "es-MX"]means "specifically Argentinian and Mexican Spanish" - The major indigenous languages of the region (
quQuechua,gnGuarani,ayAymara) - Utility values:
en,Multilingual(multiple different languages, not multiple Spanish variants),N/A
European Spanish and European Portuguese are intentionally out of scope. If a system was built for or evaluated on a LATAM variant, it belongs here. If you want to add an indigenous language we missed, open a PR — the vocabulary is meant to grow.
python3 -m http.server 8000Open http://localhost:8000/datahub.html.
No build step, no node_modules, no bundler. The page bootstraps React, Babel, and GSAP from CDN and renders directly.
data.json is the source of truth. The shape is the ontology — top-level vocabulary lists declare what's allowed; records reference values from those lists:
Every record is a dataset — that's the only kind. task is the verb the dataset trains or evaluates (the action). input_type is what the trained system consumes; output_type is what it produces. domain is a knowledge grouping, not a technical bucket. Derived values come from the top-level lists, not from records: the hero stats are domains.length / languages.length / contributing_organizations.length. Add a record whose task is not yet in tasks_supported and the validator will tell you to add the verb to the list first.
We especially welcome:
- LATAM datasets — corpora in any es-XX variant, pt-BR, or indigenous languages
- Domain-specific corpora in health, law, finance, and related fields from the region
Steps:
- Open
data.json. - If your new record uses a task, input_type, output_type, domain, language, license, or organization that doesn't exist yet, add it to the corresponding top-level list first.
- Append the record to
recordswith a newid. - Run the validator:
It exits 0 on success. On any vocabulary miss it prints exactly which record, which field, and which vocabulary is involved.
node scripts/validate-data.mjs
- Open a PR following
CONTRIBUTING.md.
If you're adding a HuggingFace dataset or a Mozilla Data Collective entry and use Claude Code, the url-to-dataset-record skill will fetch the dataset card and draft the record for you.
bash scripts/install-hooks.shThis points git at the versioned hooks/ directory. From then on, every commit runs the validator and rejects inconsistent data. The validator also runs as the first step of build.sh, so Cloudflare Pages builds fail loudly on broken data.
The project deploys to Cloudflare Pages via the workflow in .github/workflows/. build.sh validates data.json and assembles dist/; any other static host (Netlify, GitHub Pages, S3 + CloudFront) works the same way — just serve the directory.
| File | Role |
|---|---|
datahub.html |
Main shell — React app, scroll choreography, dataset explorer, signal view |
hero.jsx |
Hero overlay — title, lat/long, stat cluster, typewriter terminal |
Living Layers.html |
Animated background — loaded as a full-bleed iframe |
tweaks-panel.jsx |
Design-token utilities used by the panel |
uploads/hero-bg.png |
Background photograph (2048×1153) |
data.json |
The "db" — top-level vocabularies + records |
scripts/validate-data.mjs |
Consistency validator (run on commit + build) |
scripts/install-hooks.sh |
One-time git config core.hooksPath hooks |
hooks/pre-commit |
Runs the validator before each commit |
build.sh |
Validates data.json, populates dist/ for Cloudflare Pages |
Deep docs live in docs/, organized by reader mode (Diataxis):
| Doc | When to read |
|---|---|
| Tutorial — your first contribution | You're new and want a hands-on walkthrough from clone to PR |
| How to add a record | You know the basics and want a task-focused reference for contributing |
| Schema reference | You want the exact rules: every field, every vocabulary, every validator check |
| Why the schema looks this way | You want the design rationale — LATAM-first, ontology-in-data, validator at commit + build |
Open project. Open registry. PRs welcome from anywhere, with a strong preference for work that increases LATAM visibility. If you maintain a model, dataset, or system that fits and isn't here yet — add it.
{ // primary navigational axes "tasks_supported": ["transcribe", "..."], "input_type": ["audio", "text", "..."], // medium fed in "output_type": ["text", "audio", "..."], // medium produced "domains": ["general", "medical", "legal", "finance"], "languages": ["es-AR", "es-BO", "es-CL", "...", "pt-BR", "qu", "gn", "ay", "en", "Multilingual", "N/A"], // descriptive attribute vocabularies "licenses": ["CC0", "CC-BY-SA 4.0", "GPL-3.0", "..."], "contributing_organizations": [ { "name": "Mozilla Foundation", "logo": null } ], "records": [ { "id": 1, "task": "transcribe", // must ∈ tasks_supported "input_type": "audio", // must ∈ input_type "output_type": "text", // must ∈ output_type "domain": "general", // must ∈ domains "languages": ["es-AR"], // array — every entry must ∈ languages "organization": "Universidad Nacional de La Plata", // must match contributing_organizations "license": "CC-BY-NC-SA 4.0", "model": "CordeBA", // dataset's display name "year": 2024, "source_url": "https://huggingface.co/datasets/marianbasti/cordeba", "description": "Spontaneous-speech corpus of informal conversations…" } ] }