Skip to content

albond/obfuscated-cjk-ocr

Repository files navigation

Obfuscated CJK Font OCR

License: MIT Python 3.10+ Status: research Install: pip install -e . Accuracy Speed

A reference implementation of an OCR technique for recognising Chinese characters from "obfuscated" web fonts — fonts where standard glyphs are remapped to arbitrary Unicode codepoints so that the HTML source contains gibberish while the rendered output looks normal.

This is a research / educational project. All examples are synthetic: fonts are generated locally from public sources (Noto Sans SC), obfuscation is simulated with random codepoint remapping. The repository does not target any specific website and ships no scraping tooling.


What is "font obfuscation"?

A few CJK-language platforms protect their text by shipping a custom font where standard characters are remapped to arbitrary private codepoints. When the page is rendered, the font draws the correct glyph for each codepoint, so a human reader sees normal text. But the raw HTML contains a different set of Unicode codepoints — so anything reading innerText gets noise.

Example:

  • HTML contains codepoint U+E001
  • The custom font draws the glyph for when given U+E001
  • A reader sees ; anything reading the raw HTML sees ``

General-purpose OCR (Tesseract, EasyOCR, …) can solve this in principle, but:

  • It's heavyweight for what is structurally a simpler problem
  • It can be wrong on visually-similar characters
  • Accuracy on small / web-rendered text varies

The technique in this repo exploits a useful observation: the obfuscated glyph is usually a remapped copy of a standard reference font (Noto Sans SC, Source Han Sans, etc.). So the problem isn't really "what character is this?" — it's "which of the ~30 000 reference glyphs does this glyph match?". That's a nearest-neighbour problem in a small feature space, not full OCR.


Why this matters: translation

A common real-world use case for Chinese-language web content is machine translation — feeding the page through Google Translate, DeepL, browser-integrated translators, or LLM-based pipelines. All of these expect their input to be real text: they read the HTML, extract the visible string, and send it to the translation engine.

That pipeline works perfectly on a normal page. It breaks completely on a page with font obfuscation: the HTML contains gibberish codepoints, so the translator sees gibberish and either passes it through unchanged or produces nonsense. The reader sees correctly-rendered Chinese on screen, but the translator next to them shows random Unicode boxes — there is no path from "what is on the screen" to "translated English text" without first solving the obfuscation.

flowchart LR
    H1[Obfuscated HTML] --> T1[Translator]
    T1 --> Bad[Gibberish output]

    H2[Obfuscated HTML] --> R[Rendered pixels<br/>in the browser]
    R --> O[OCR cascade<br/>this repo]
    O --> Real[Recovered real text<br/>中国大山木水...]
    Real --> T2[Translator]
    T2 --> Good[Correct translation]

    style Bad fill:#fde2e2,stroke:#c66,color:#622
    style Good fill:#dff0d8,stroke:#6a6,color:#262
    style O fill:#dde7f7,stroke:#46a,stroke-width:2px,color:#234
Loading

The cascade recovers the real Unicode text from the rendered glyphs. After that, any translator works as it would on a normal page. That is the end-to-end shape of the problem this repo addresses.


Method

The pipeline has two parts: build the reference tables once, offline, then recognise any query glyph against them in a few milliseconds.

Step 1 — Building the reference tables

A reference table is a serialised array of feature vectors, one per CJK character in the reference font. It is built once, then loaded by the recogniser at runtime.

Reference table build flow: font → iterate codepoints → render and grid extraction → archive

For each character:

  • Render the glyph onto a high-resolution canvas using the reference font.
  • Tight-crop the resulting bitmap to the actual inked region. This normalises shape across glyphs of very different proportions (, , ).
  • Down-sample the cropped bitmap to two grids: a binary 12×12 occupancy grid (fast filter) and a continuous 16×16 density grid (precise ranking).

The build step is the :func:build_reference_tables function in obfuscated_cjk_ocr/build_tables.py and takes about a minute for ~30 000 glyphs on a modern laptop.

Step 2 — Recognising a query glyph

The recogniser runs as a two-stage cascade with an optional third correction step. The cascade structure is what makes this fast: stage 1 is cheap but coarse, stage 2 is precise but only runs on ~40 candidates, stage 3 only fires on known-ambiguous pairs.

Recognition cascade: obfuscated input glyph → feature extraction (occupancy + density grids) → ranked candidates with best match highlighted

Stage 1 — Occupancy filter. Compute the binary 12×12 occupancy grid of the query. Hamming-distance against every reference entry. Keep the 40 closest. This is O(N × 144 bits) — sub-millisecond against 30 000 references.

Stage 2 — Density refinement. For each of the 40 survivors, compute the L2 distance between the query's 16×16 density grid and the reference's. The smallest L2 wins. This catches everything the coarse binary feature can't distinguish.

Stage 3 — Confusables correction (optional). A handful of CJK pairs are visually almost identical at these resolutions (/, /, /, …). A small ruleset says "if stage 2 returns A, examine density in region R; if it falls on the alternative's side of threshold T, swap A → B". Recovers the last ~0.2% of accuracy.

For full algorithmic detail (design rationale, complexity, where the technique generalises) see docs/method.md. For practical engineering pitfalls (CID-keyed font codepoint bug, anti-aliasing artefacts, confusables discovery workflow, abstract DOM-level patterns to be aware of) see docs/engineering_notes.md.


Contents

Path What's inside
obfuscated_cjk_ocr/ The Python package. find_closest_font, build_reference_tables, decode_font, plus lower-level primitives (recognize, render_glyph, density_grid, occupancy_grid) and the confusables submodule
examples/ CLI demos for each step of the pipeline, plus a synthetic benchmark and a sample confusables_example.json
chrome_extension/ Minimal browser-extension demo. Runs the cascade inside its own popup on canvas-rendered glyphs
docs/ method.md (deep methodology) and engineering_notes.md (hard-won pitfalls)

Install

pip install -e .
# .woff2 obfuscated fonts also need brotli:
pip install -e ".[woff2]"

Then either use the package programmatically (next section) or run the example scripts.

Library API — the three core functions

from obfuscated_cjk_ocr import (
    find_closest_font,
    build_reference_tables,
    decode_font,
    save_tables,
    load_tables,
)

# Step 1 — Identify which reference font visually matches the obfuscated one.
#          Pure-shape comparison; no labels needed.
best, scores = find_closest_font(
    "obfuscated.woff2",
    candidate_font_paths=[
        "fonts/NotoSansSC-Regular.otf",
        "fonts/SourceHanSans-Regular.otf",
        "fonts/NotoSerifSC-Regular.otf",
    ],
)

# Step 2 — Build reference tables from the winning candidate.
tables = build_reference_tables(best)
save_tables(tables, "ref_tables.npz")     # optional: cache for reuse

# Step 3 — Recognise every glyph in the obfuscated font.
#          Returns {obfuscated_codepoint: real_unicode_char}.
mapping = decode_font("obfuscated.woff2", tables)

# Apply the mapping to recover real text from anything in the
# obfuscated codepoint space:
gibberish_from_html = "..."
real_text = "".join(mapping.get(ord(c), c) for c in gibberish_from_html)

Quick start (CLI examples)

# Download Noto Sans SC (free, SIL OFL):
#   https://fonts.google.com/noto/specimen/Noto+Sans+SC

# Step 1 — identify the closest reference font:
python examples/01_identify_font.py \
    --obfuscated path/to/obfuscated.woff2 \
    --candidates fonts/NotoSansSC-Regular.otf fonts/NotoSerifSC-Regular.otf

# Step 2 — build reference tables from a chosen font:
python examples/02_build_tables.py \
    --font fonts/NotoSansSC-Regular.otf \
    --out ref_tables.npz

# Step 3 — decode an obfuscated font end-to-end:
python examples/04_decode_font.py \
    --obfuscated path/to/obfuscated.woff2 \
    --tables ref_tables.npz \
    --confusables examples/confusables_example.json \
    --out mapping.json

# Single-glyph debug recognition:
python examples/03_recognize_glyph.py \
    --tables ref_tables.npz \
    --font fonts/NotoSansSC-Regular.otf \
    --char 中

# Synthetic accuracy benchmark (all three stages):
python examples/benchmark.py \
    --ref-font fonts/NotoSansSC-Regular.otf \
    --query-font fonts/NotoSansSC-Regular.otf \
    --confusables examples/confusables_example.json \
    --n 2000

Reproducibility

All experiments use:

  • Reference font: Noto Sans SC (SIL Open Font License)
  • Synthetic obfuscation: random permutation of a 2 000-character subset, same font re-rendered under the permuted codepoints
  • Test set: held-out characters not used to build the reference table

Typical numbers on a 2 000-character random sample:

Stage Top-1 accuracy
1 — occupancy 12×12 only ~97%
2 — cascade through density 16×16 ~99.9%
3 — + confusables correction ~99.96%

End-to-end latency: 2–4 ms per query on a modern laptop, dominated by glyph rendering. The cascade itself takes ~1 ms.

The benchmark script in examples/benchmark.py prints exact numbers for your environment.


Extending the technique

The cascade is not specific to CJK obfuscated fonts. The pattern — cheap binary filter → expensive continuous re-rank → narrow targeted correction — applies to any closed-vocabulary shape recognition problem.

Examples where the same scaffold reusably works:

  • Other obfuscated scripts. Cyrillic, Greek, or Arabic obfuscated fonts use the same trick of remapping codepoints to standard shapes; rebuilding the reference table from a script-appropriate font is the only change.
  • Icon / logo classification against a known reference set. Both features (occupancy + density) are agnostic to what the shape is.
  • Vehicle plate character recognition in cleanroom conditions (rendered or printed plates, not photographed) where the font is known.
  • CAPTCHA-style glyph recognition when the rendering is deterministic and noise-free.

For each new script, the things to tune are coarse_gs (coarser for simpler shapes, finer for intricate ones), the confusables ruleset (every script has its own confusable pairs), and possibly the font matching step (find_closest_font). Everything else stays the same.


Scope and limitations

This technique works when:

  • The obfuscated glyph is shape-identical (or near-identical) to a glyph in the reference font
  • The query glyph is available either as a vector outline or as a high-resolution clean raster

It does not solve:

  • General-purpose OCR on photographs, scans, or hand-drawn text
  • Stylistically distinct fonts with no near-matching reference
  • Document layout / paragraph reconstruction / character segmentation

What lives outside the cascade (you wire it up)

End-to-end use against a real-world obfuscated source has two layers this package deliberately does not cover:

Upstream — DOM extraction. Pages frequently combine font obfuscation with DOM-level tricks: real characters in CSS pseudo-elements (::before { content: "X" }), display:none decoy children, scrambled paragraph order rebuilt via absolute positioning, etc. The right text to feed into decode_font is what a human reader sees on screen, which the browser's layout engine produces — not raw innerText. See docs/engineering_notes.md for the patterns to watch for.

Downstream — variant-canonical normalisation. Some CJK characters are visually identical but encoded under distinct Unicode codepoints ( U+4E21 vs U+96E8, U+9763 vs U+9762, …). The cascade picks the closest shape match in the reference font, which can be the variant rather than the canonical form. NFKD (the package applies it internally for the CJK Compatibility Ideographs block U+F900..U+FAFF) covers part of this, but variants in CJK Extension A and similar ranges need a per-project remap. recognize and decode_font accept an optional variant_map of the form {"variant_char": "canonical_char"}; see examples/variant_map_example.json for the schema and a worked starter set. A complete table can be derived from the Unihan database's kCompatibilityVariant and kSemanticVariant fields.

The package does the visual recognition step well; the wrappers around it are the user's responsibility.


License

MIT. See LICENSE. Noto Sans SC, referenced for demos, is under SIL OFL — not included in this repository.

A note on intent

This code exists to document a computer-vision technique. It is not a scraping tool, does not target any specific website, and does not include code or data tied to any particular service. If anyone wants to apply this technique to real-world content, please make sure you have the right to do so under the terms of the source you are reading.

About

Two-stage occupancy + density OCR cascade for CJK characters from obfuscated web fonts. Python package, ~99.96% accuracy.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors