Imagized Language Model (ILM)

English · العربية · Español · Français · 日本語 · 한국어 · Tiếng Việt · 中文 (简体) · 中文（繁體） · Deutsch · Русский

Imagized Language Model (ILM)

ILM is a research codebase exploring text-as-image generation: it encodes language into compact, image-like tensors and generates text with diffusion-style iterative refinement. The representation factors sentences into meta-elements (grammar, semantics, tone, emotion) and hierarchical, memory-like codes for words and characters. This unifies ideas from discrete diffusion, superposition/disentanglement, structured embeddings, and glyph-aware character modeling.

The repository intentionally keeps a practical etymology pipeline and long-horizon ILM experimentation side-by-side.

📌 Overview

This repository has two active tracks:

Historic Chinese glyph etymology ingestion (scraping/parsing/storage/preview).
ILM glyph/image modeling experiments (token glyph rendering, codebooks, frame packing, diffusion/inpainting, evaluation/reporting).

This README also documents both tracks and keeps the etymology workflow as a first-class, reproducible path.

🔗 Key Links

Area	Path
Conceptual write-up	`docs/imagized-language-model.md`
Code plan and metrics	`docs/ilm-visual-diffusion-code-plan.md`
Embedding "color" plan	`docs/embedding-color-plan.md`
Development notes/plan	`docs/development-plan.md`
Etymology module readme	`ilm/etymology/README.md`

✨ Features

🏺 Etymology ingestion from hanziyuan and chineseetymology-style sources.
🌐 Robust AJAX + HTML ingestion path with retries, throttling, and cache.
🧩 Stage-labeled glyph extraction including <img> and CSS background-image data URIs.
🗃️ SQLite-backed storage for chars/glyph metadata plus filesystem asset layout.
🖥️ Tornado web UI for ad-hoc ingest + gallery preview.
🔤 Glyph rendering utilities for multilingual token images.
🧠 Product-code style embedding/codebook modules.
🧱 Sentence frame packing and diffusion/inpainting training/evaluation scripts.
📊 Reporting and visualization scripts for embedding and pipeline inspection.
📄 Publication artifacts in LaTeX/PDF under publication/.

🧱 Project Structure

.
├── README.md
├── AGENTS.md
├── configs/
│   ├── color.yaml
│   └── diffusion.yaml
├── docs/
├── i18n/
├── ilm/
│   ├── code/
│   ├── data/
│   ├── datasets/
│   ├── db/
│   ├── diffusion/
│   ├── encoders/
│   ├── english_tiles/
│   ├── etymology/
│   ├── frames/
│   ├── models/
│   └── utils/
├── scripts/
├── publication/
├── assets/
├── logs/
└── *.ipynb

🧰 Prerequisites

Requirement	Notes
Python `3.10+`	Core runtime
`pip`	Package installation
Optional GPU	Helpful for PyTorch CUDA training scripts
Optional LaTeX toolchain	Needed for publication builds

Assumption note: there is currently no single root dependency lock/spec file (pyproject.toml, requirements.txt, etc.), so dependencies are inferred from imports and script usage.

⚙️ Installation

Minimal (etymology toolkit)

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install requests beautifulsoup4 tornado

Extended (modeling/training workflows)

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install requests beautifulsoup4 tornado pyyaml numpy pillow matplotlib torch

If a specific script needs additional packages, install them from the import error shown by that script.

🚀 Usage

Quick Start: Historic Glyph Ingestion (CLI)

Hanziyuan (recommended): char-only AJAX flow

PYTHONPATH=. python scripts/ingest_etymology.py --site hanziyuan --char 中

ChineseEtymology (direct URL)

PYTHONPATH=. python scripts/ingest_etymology.py --site chineseetymology --url "https://www.chineseetymology.org/CharacterEtymology.aspx?characterInput=%E4%B8%AD"

Batch file ingestion (lines can be char\turl, url, or char url)

PYTHONPATH=. python scripts/ingest_etymology.py --from-file urls.txt

Outputs

Output Type	Location
Files	`data/historic/glyphs/<char>/<stage>/<label>.<ext>`
Cache	`data/historic/cache/*.html`
DB	`data/historic/etymology.sqlite3`

Web Demo (optional)

PYTHONPATH=. python scripts/serve_etymology.py

Open http://127.0.0.1:8888, choose site, enter a character (for example 中).

Polite Crawling and Site Respect

The fetcher uses per-host throttling, retries with backoff, and caching.
Keep delays >= 0.5s, avoid bursts, and honor site terms/robots/licensing.
Do not bypass paywalls or interactive protections.
If you see 403/429, slow down and retry later.

Additional ILM Workflows

These scripts exist and are actively part of the repo surface, but they are research workflows and may require prepared local datasets/checkpoints.

Data download/prep

python scripts/download_alpaca.py --outdir data/raw
python scripts/download_corpora.py --out data/raw
python scripts/sample_paragraphs.py --out data/processed/test_100.jsonl
python scripts/build_images_common_freq.py --out data/processed/images_common_freq --size 128 --en 5000 --zh 5000

Glyph DB lifecycle

python scripts/glyphdb_init.py --db data/glyphdb/glyphs.sqlite3
python scripts/glyphdb_ingest_index.py --db data/glyphdb/glyphs.sqlite3 --index data/processed/images_common_freq/index.tsv

Code/color model training

python scripts/train_color_codes.py --config configs/color.yaml
python scripts/train_codes_from_qa.py --en-json data/raw/alpaca_en.json --zh-json data/raw/alpaca_zh.json --epochs 1
python scripts/train_ilmglyph_codes.py --en data/raw/alpaca_en.json --zh data/raw/alpaca_zh.json --out artifacts/ilm_glyph_train

Diffusion/inpainting

python scripts/train_diffusion.py --config configs/diffusion.yaml
python scripts/train_inpaint_frames.py --ckpt-code artifacts/ilm_glyph_train/ckpt_epoch1.pt --out artifacts/inpaint

Evaluation/reporting

python scripts/eval_color_codes.py --checkpoint artifacts/color_codes_e1.pt
python scripts/eval_diffusion.py --checkpoint artifacts/diffusion_unet.pt
python scripts/eval_qa_retrieval.py --checkpoint artifacts/color_codes_qa.pt
python scripts/report_ilmglyph_pipeline.py --ckpt artifacts/ilm_glyph_train/ckpt_epoch1.pt --lang en --text "hello world"

🧩 Configuration

Primary YAML configs:

configs/color.yaml
- data path: data/processed/images_common_freq/index.tsv
- model/code params: d_glyph, d_code, K, C, temperature/anneal
- optimizer/log settings
configs/diffusion.yaml
- input JSONL: data/processed/test_100.jsonl
- frame/grid + model size settings
- train mask ratio range and checkpoint settings

Override settings via CLI flags where supported (--epochs, --batch-size, --lr, etc.).

🧪 Examples

Build a single English tile glyph:

python scripts/build_english_tile_glyph.py "language" artifacts/language_tile --save-tensor

Run inpainting demo with trained checkpoints:

python scripts/inpaint_demo.py \
  --ckpt-code artifacts/ilm_glyph_train/ckpt_epoch1.pt \
  --ckpt-inpaint artifacts/inpaint/ckpt_epoch1.pt \
  --lang en \
  --text "the quick brown fox jumps" \
  --mode infill \
  --out artifacts/inpaint_demo

Bulk ingest common characters from Hanziyuan:

PYTHONPATH=. python scripts/bulk_ingest_hanziyuan.py --limit 200 --resume

📝 Development Notes

This is a research repository with both robust CLIs and exploratory artifacts (including notebooks and prototype scripts).
Generated large files are intended for data/ and artifacts/ (both ignored in .gitignore).
Publication source and PDFs are under publication/; helper build script: scripts/latex_build.sh.
Collaboration/process conventions are documented in AGENTS.md.

🛠️ Troubleshooting

ModuleNotFoundError: ilm...
- Run scripts from repo root.
- Use PYTHONPATH=. for scripts that expect local package resolution.
FileNotFoundError for data/index/checkpoints
- Run prerequisite data/build scripts first.
- Confirm defaults such as data/processed/images_common_freq/index.tsv and data/processed/test_100.jsonl exist.
CUDA/device issues
- Switch to CPU with script flags/config (device: cpu or --device cpu).
Missing package errors
- Install required dependency from the specific script import path (torch, pyyaml, Pillow, etc.).
HTTP 403 / 429 while scraping
- Increase --delay, retry later, and keep requests polite.

🗺️ Roadmap

Continue maturing the text-as-image ILM training/eval runbooks beyond the etymology-first quick start.
Improve environment reproducibility (single authoritative dependency spec).
Expand tests/CI coverage for research scripts and pipeline glue.
Iterate on hierarchical codebooks, diffusion objectives, and controllability channels.
Consolidate docs across docs/, script help text, and publication artifacts.

For deeper conceptual and staged planning details, see:

docs/imagized-language-model.md
docs/ilm-visual-diffusion-code-plan.md
docs/development-plan.md

🤝 Contributing

Follow AGENTS.md for conventions (atomic commits, push after change, no credentials in code).
Group related edits in focused commits with conventional messages.
Prefer reproducible script invocations with explicit flags and input paths.
For scraping-related changes, preserve throttling/cache behavior and site-respect constraints.

❤️ Support

Donate	PayPal	Stripe

📄 License

No top-level license file is currently present in this repository.

Assumption note: treat the project as research code with unspecified licensing until a LICENSE file is added by maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.ipynb_checkpoints		.ipynb_checkpoints
assets		assets
configs		configs
docs		docs
i18n		i18n
ilm		ilm
logs		logs
markdown_notes		markdown_notes
publication		publication
scripts		scripts
._@.png		._@.png
._author.png		._author.png
._includes.png		._includes.png
._incoder.py		._incoder.py
._teaspoon.png		._teaspoon.png
._university.edu.png		._university.edu.png
.gitignore		.gitignore
@.png		@.png
AGENTS.md		AGENTS.md
README.md		README.md
author.png		author.png
includes.png		includes.png
incode-pyro.ipynb		incode-pyro.ipynb
incoder-v2-in-p-q-language.ipynb		incoder-v2-in-p-q-language.ipynb
incoder.ipynb		incoder.ipynb
incoder.py		incoder.py
teaspoon.png		teaspoon.png
university.edu.png		university.edu.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Imagized Language Model (ILM)

📌 Overview

🔗 Key Links

✨ Features

🧱 Project Structure

🧰 Prerequisites

⚙️ Installation

Minimal (etymology toolkit)

Extended (modeling/training workflows)

🚀 Usage

Quick Start: Historic Glyph Ingestion (CLI)

Outputs

Web Demo (optional)

Polite Crawling and Site Respect

Additional ILM Workflows

🧩 Configuration

🧪 Examples

📝 Development Notes

🛠️ Troubleshooting

🗺️ Roadmap

🤝 Contributing

❤️ Support

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Imagized Language Model (ILM)

📌 Overview

🔗 Key Links

✨ Features

🧱 Project Structure

🧰 Prerequisites

⚙️ Installation

Minimal (etymology toolkit)

Extended (modeling/training workflows)

🚀 Usage

Quick Start: Historic Glyph Ingestion (CLI)

Outputs

Web Demo (optional)

Polite Crawling and Site Respect

Additional ILM Workflows

🧩 Configuration

🧪 Examples

📝 Development Notes

🛠️ Troubleshooting

🗺️ Roadmap

🤝 Contributing

❤️ Support

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages