A reference implementation of an OCR technique for recognising Chinese characters from "obfuscated" web fonts — fonts where standard glyphs are remapped to arbitrary Unicode codepoints so that the HTML source contains gibberish while the rendered output looks normal.
This is a research / educational project. All examples are synthetic: fonts are generated locally from public sources (Noto Sans SC), obfuscation is simulated with random codepoint remapping. The repository does not target any specific website and ships no scraping tooling.
A few CJK-language platforms protect their text by shipping a custom font
where standard characters are remapped to arbitrary private codepoints. When
the page is rendered, the font draws the correct glyph for each codepoint,
so a human reader sees normal text. But the raw HTML contains a different
set of Unicode codepoints — so anything reading innerText gets noise.
Example:
- HTML contains codepoint
U+E001 - The custom font draws the glyph for
中when givenU+E001 - A reader sees
中; anything reading the raw HTML sees ``
General-purpose OCR (Tesseract, EasyOCR, …) can solve this in principle, but:
- It's heavyweight for what is structurally a simpler problem
- It can be wrong on visually-similar characters
- Accuracy on small / web-rendered text varies
The technique in this repo exploits a useful observation: the obfuscated glyph is usually a remapped copy of a standard reference font (Noto Sans SC, Source Han Sans, etc.). So the problem isn't really "what character is this?" — it's "which of the ~30 000 reference glyphs does this glyph match?". That's a nearest-neighbour problem in a small feature space, not full OCR.
A common real-world use case for Chinese-language web content is machine translation — feeding the page through Google Translate, DeepL, browser-integrated translators, or LLM-based pipelines. All of these expect their input to be real text: they read the HTML, extract the visible string, and send it to the translation engine.
That pipeline works perfectly on a normal page. It breaks completely on a page with font obfuscation: the HTML contains gibberish codepoints, so the translator sees gibberish and either passes it through unchanged or produces nonsense. The reader sees correctly-rendered Chinese on screen, but the translator next to them shows random Unicode boxes — there is no path from "what is on the screen" to "translated English text" without first solving the obfuscation.
flowchart LR
H1[Obfuscated HTML] --> T1[Translator]
T1 --> Bad[Gibberish output]
H2[Obfuscated HTML] --> R[Rendered pixels<br/>in the browser]
R --> O[OCR cascade<br/>this repo]
O --> Real[Recovered real text<br/>中国大山木水...]
Real --> T2[Translator]
T2 --> Good[Correct translation]
style Bad fill:#fde2e2,stroke:#c66,color:#622
style Good fill:#dff0d8,stroke:#6a6,color:#262
style O fill:#dde7f7,stroke:#46a,stroke-width:2px,color:#234
The cascade recovers the real Unicode text from the rendered glyphs. After that, any translator works as it would on a normal page. That is the end-to-end shape of the problem this repo addresses.
The pipeline has two parts: build the reference tables once, offline, then recognise any query glyph against them in a few milliseconds.
A reference table is a serialised array of feature vectors, one per CJK character in the reference font. It is built once, then loaded by the recogniser at runtime.
For each character:
- Render the glyph onto a high-resolution canvas using the reference font.
- Tight-crop the resulting bitmap to the actual inked region. This
normalises shape across glyphs of very different proportions (
一,中,麗). - Down-sample the cropped bitmap to two grids: a binary 12×12 occupancy grid (fast filter) and a continuous 16×16 density grid (precise ranking).
The build step is the :func:build_reference_tables function in
obfuscated_cjk_ocr/build_tables.py
and takes about a minute for ~30 000 glyphs on a modern laptop.
The recogniser runs as a two-stage cascade with an optional third correction step. The cascade structure is what makes this fast: stage 1 is cheap but coarse, stage 2 is precise but only runs on ~40 candidates, stage 3 only fires on known-ambiguous pairs.
Stage 1 — Occupancy filter. Compute the binary 12×12 occupancy grid of the query. Hamming-distance against every reference entry. Keep the 40 closest. This is O(N × 144 bits) — sub-millisecond against 30 000 references.
Stage 2 — Density refinement. For each of the 40 survivors, compute the L2 distance between the query's 16×16 density grid and the reference's. The smallest L2 wins. This catches everything the coarse binary feature can't distinguish.
Stage 3 — Confusables correction (optional). A handful of CJK pairs
are visually almost identical at these resolutions (日/曰,
万/方, 个/介, …). A small ruleset says "if stage 2 returns A,
examine density in region R; if it falls on the alternative's side of
threshold T, swap A → B". Recovers the last ~0.2% of accuracy.
For full algorithmic detail (design rationale, complexity, where the
technique generalises) see docs/method.md. For
practical engineering pitfalls (CID-keyed font codepoint bug,
anti-aliasing artefacts, confusables discovery workflow, abstract
DOM-level patterns to be aware of) see
docs/engineering_notes.md.
| Path | What's inside |
|---|---|
obfuscated_cjk_ocr/ |
The Python package. find_closest_font, build_reference_tables, decode_font, plus lower-level primitives (recognize, render_glyph, density_grid, occupancy_grid) and the confusables submodule |
examples/ |
CLI demos for each step of the pipeline, plus a synthetic benchmark and a sample confusables_example.json |
chrome_extension/ |
Minimal browser-extension demo. Runs the cascade inside its own popup on canvas-rendered glyphs |
docs/ |
method.md (deep methodology) and engineering_notes.md (hard-won pitfalls) |
pip install -e .
# .woff2 obfuscated fonts also need brotli:
pip install -e ".[woff2]"Then either use the package programmatically (next section) or run the example scripts.
from obfuscated_cjk_ocr import (
find_closest_font,
build_reference_tables,
decode_font,
save_tables,
load_tables,
)
# Step 1 — Identify which reference font visually matches the obfuscated one.
# Pure-shape comparison; no labels needed.
best, scores = find_closest_font(
"obfuscated.woff2",
candidate_font_paths=[
"fonts/NotoSansSC-Regular.otf",
"fonts/SourceHanSans-Regular.otf",
"fonts/NotoSerifSC-Regular.otf",
],
)
# Step 2 — Build reference tables from the winning candidate.
tables = build_reference_tables(best)
save_tables(tables, "ref_tables.npz") # optional: cache for reuse
# Step 3 — Recognise every glyph in the obfuscated font.
# Returns {obfuscated_codepoint: real_unicode_char}.
mapping = decode_font("obfuscated.woff2", tables)
# Apply the mapping to recover real text from anything in the
# obfuscated codepoint space:
gibberish_from_html = "..."
real_text = "".join(mapping.get(ord(c), c) for c in gibberish_from_html)# Download Noto Sans SC (free, SIL OFL):
# https://fonts.google.com/noto/specimen/Noto+Sans+SC
# Step 1 — identify the closest reference font:
python examples/01_identify_font.py \
--obfuscated path/to/obfuscated.woff2 \
--candidates fonts/NotoSansSC-Regular.otf fonts/NotoSerifSC-Regular.otf
# Step 2 — build reference tables from a chosen font:
python examples/02_build_tables.py \
--font fonts/NotoSansSC-Regular.otf \
--out ref_tables.npz
# Step 3 — decode an obfuscated font end-to-end:
python examples/04_decode_font.py \
--obfuscated path/to/obfuscated.woff2 \
--tables ref_tables.npz \
--confusables examples/confusables_example.json \
--out mapping.json
# Single-glyph debug recognition:
python examples/03_recognize_glyph.py \
--tables ref_tables.npz \
--font fonts/NotoSansSC-Regular.otf \
--char 中
# Synthetic accuracy benchmark (all three stages):
python examples/benchmark.py \
--ref-font fonts/NotoSansSC-Regular.otf \
--query-font fonts/NotoSansSC-Regular.otf \
--confusables examples/confusables_example.json \
--n 2000All experiments use:
- Reference font: Noto Sans SC (SIL Open Font License)
- Synthetic obfuscation: random permutation of a 2 000-character subset, same font re-rendered under the permuted codepoints
- Test set: held-out characters not used to build the reference table
Typical numbers on a 2 000-character random sample:
| Stage | Top-1 accuracy |
|---|---|
| 1 — occupancy 12×12 only | ~97% |
| 2 — cascade through density 16×16 | ~99.9% |
| 3 — + confusables correction | ~99.96% |
End-to-end latency: 2–4 ms per query on a modern laptop, dominated by glyph rendering. The cascade itself takes ~1 ms.
The benchmark script in
examples/benchmark.py prints exact numbers
for your environment.
The cascade is not specific to CJK obfuscated fonts. The pattern — cheap binary filter → expensive continuous re-rank → narrow targeted correction — applies to any closed-vocabulary shape recognition problem.
Examples where the same scaffold reusably works:
- Other obfuscated scripts. Cyrillic, Greek, or Arabic obfuscated fonts use the same trick of remapping codepoints to standard shapes; rebuilding the reference table from a script-appropriate font is the only change.
- Icon / logo classification against a known reference set. Both features (occupancy + density) are agnostic to what the shape is.
- Vehicle plate character recognition in cleanroom conditions (rendered or printed plates, not photographed) where the font is known.
- CAPTCHA-style glyph recognition when the rendering is deterministic and noise-free.
For each new script, the things to tune are coarse_gs (coarser for
simpler shapes, finer for intricate ones), the confusables ruleset
(every script has its own confusable pairs), and possibly the font
matching step (find_closest_font). Everything else stays the same.
This technique works when:
- The obfuscated glyph is shape-identical (or near-identical) to a glyph in the reference font
- The query glyph is available either as a vector outline or as a high-resolution clean raster
It does not solve:
- General-purpose OCR on photographs, scans, or hand-drawn text
- Stylistically distinct fonts with no near-matching reference
- Document layout / paragraph reconstruction / character segmentation
End-to-end use against a real-world obfuscated source has two layers this package deliberately does not cover:
Upstream — DOM extraction. Pages frequently combine font obfuscation
with DOM-level tricks: real characters in CSS pseudo-elements
(::before { content: "X" }), display:none decoy children, scrambled
paragraph order rebuilt via absolute positioning, etc. The right text
to feed into decode_font is what a human reader sees on screen,
which the browser's layout engine produces — not raw innerText. See
docs/engineering_notes.md
for the patterns to watch for.
Downstream — variant-canonical normalisation. Some CJK characters
are visually identical but encoded under distinct Unicode codepoints
(両 U+4E21 vs 雨 U+96E8, 靣 U+9763 vs 面 U+9762, …).
The cascade picks the closest shape match in the reference font,
which can be the variant rather than the canonical form. NFKD (the
package applies it internally for the CJK Compatibility Ideographs
block U+F900..U+FAFF) covers part of this, but variants in CJK
Extension A and similar ranges need a per-project remap.
recognize and decode_font accept an optional variant_map of
the form {"variant_char": "canonical_char"}; see
examples/variant_map_example.json
for the schema and a worked starter set. A complete table can be
derived from the Unihan database's kCompatibilityVariant and
kSemanticVariant fields.
The package does the visual recognition step well; the wrappers around it are the user's responsibility.
MIT. See LICENSE. Noto Sans SC, referenced for demos, is under SIL OFL — not included in this repository.
This code exists to document a computer-vision technique. It is not a scraping tool, does not target any specific website, and does not include code or data tied to any particular service. If anyone wants to apply this technique to real-world content, please make sure you have the right to do so under the terms of the source you are reading.