GLYPH

GLYPH is a byte-exact substring retrieval engine over raw bytes.

It is designed for high-speed exact matching without tokenization or scoring.

It is NOT a search engine:

no ranking
no fuzzy matching
no scoring

It performs deterministic exact matches at scale.

⚡ Try it in 10 seconds

git clone https://github.com/yasha1971-coder/glyph-engine
cd glyph-engine
./examples/mini/build_mini.sh

Expected output:

count:    2

This runs a full sentinel-safe pipeline:

prepares a real appended 0x00 sentinel corpus
builds suffix array (SA)
builds BWT
builds FM-index
runs a real query

No large datasets required.

Index your own file

Build an FM index for any small file:

tools/build_glyph_index_v1.sh /path/to/your/file /tmp/glyph-index

Query an exact pattern:

./build/query_fm_v1 /tmp/glyph-index/fm.bin /tmp/glyph-index/bwt.bin "$(printf 'your pattern' | xxd -p -c 999999)"

Important:

input corpus must not contain 0x00
GLYPH v0.x appends a real terminal 0x00 sentinel
current indexes are optimized for static corpora
current RAM overhead is high

Build

Before running direct FM queries, build the C++ binaries:

cmake -S . -B build
cmake --build build -j2

Required tools:

CMake
C++17 compiler
Python 3
xxd

Documentation

Architecture:

docs/architecture/ENGINE_OVERVIEW.md

Specifications:

docs/specs/INDEX_FORMAT_V1.md
docs/specs/SENTINEL_INVARIANT.md
docs/specs/KNOWN_LIMITATIONS.md

Benchmarks:

benchmarks/HDFS_1GB_BENCHMARK.md
benchmarks/SEGMENTED_FIXED_CORRECTNESS.md

Business / Contact:

docs/business/CONTACT.md

Project boundaries:

WHAT_GLYPH_IS_NOT.md

Entry points

Path	Purpose
`examples/mini/`	Start here. Self-contained demo.
`tools/build_glyph_index_v1.sh`	Canonical sentinel-safe index builder.
`build/query_fm_v1`	Direct FM query binary.
`glyph_cli.py`	HTTP client for a running local GLYPH server.
`glyph_http_server.py`	Experimental persistent HTTP backend.
`glyph_segmented_query_v1.py`	Experimental segmented query path.

Advanced: HTTP server mode

This mode is experimental and expects prepared index artifacts plus a running local HTTP server.

Note:

run.sh expects local prepared demo artifacts
large corpus/index artifacts are not included

Check service:

curl http://127.0.0.1:18080/health

Query prepared demo data:

./glyph_cli.py --hex "$(xxd -p -c 999999 /tmp/query_41905.bin)"

Expected:

JSON response with exact byte-match shortlist

Core guarantees

byte-exact substring retrieval
deterministic results
no ranking
no fuzzy matching
no tokenization
no semantic interpretation
sentinel-safe FM-index construction

What problem it solves

Most systems trade accuracy for flexibility:

grep → scans (slow at scale)
Elasticsearch → ranks (approximate)
vector search → approximate similarity

GLYPH does the opposite:

exact byte matches
no interpretation
deterministic results

When to use

large-scale log search
binary corpus lookup
forensic / debugging analysis
RAG pre-filtering (exact stage before embedding)

Performance

~1.3–1.7 ms (warm)
~4 ms p99 (4GB shard)
mmap-based index

RAM note:

Current plain index artifacts are large. The HDFS 1GB benchmark used about 9.4GB RAM for 1GB corpus-scale experiments. This is a known limitation. Future work must address compressed/sampled SA and better memory economics.

Status

Experimental prototype.

See:

RUNBOOK_4GB.md
DEMO_SECURITY.md

Contact

Website: https://glyph.rs
Email: contact@glyph.rs

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github		.github
archive		archive
benchmarks		benchmarks
config		config
docs		docs
examples/mini		examples/mini
manifests		manifests
research		research
src		src
tests		tests
third_party/libsais/libsais		third_party/libsais/libsais
tools		tools
.gitignore		.gitignore
BUG_FM_UNDERCOUNT_HDFS.md		BUG_FM_UNDERCOUNT_HDFS.md
CMakeLists.txt		CMakeLists.txt
CNAME		CNAME
CONTRIBUTING.md		CONTRIBUTING.md
CORRECTNESS_INVARIANTS.md		CORRECTNESS_INVARIANTS.md
DEMO_SECURITY.md		DEMO_SECURITY.md
DEMO_SNIPPET.md		DEMO_SNIPPET.md
EXACT_VERIFICATION_LAYER.md		EXACT_VERIFICATION_LAYER.md
LICENSE		LICENSE
LOCATE_CORRECTNESS_NOTES.md		LOCATE_CORRECTNESS_NOTES.md
OVH_PRECHECK_v1.md		OVH_PRECHECK_v1.md
PATENT_RISK_AUDIT_v2.md		PATENT_RISK_AUDIT_v2.md
PRODUCT_BASELINE_v1.md		PRODUCT_BASELINE_v1.md
PROJECT_IDENTITY.md		PROJECT_IDENTITY.md
README.md		README.md
ROADMAP.md		ROADMAP.md
RUNBOOK_4GB.md		RUNBOOK_4GB.md
SA64_DESIGN.md		SA64_DESIGN.md
SEGMENTED_8GB_PLAN.md		SEGMENTED_8GB_PLAN.md
SHARD_BOUNDARY_SEMANTICS.md		SHARD_BOUNDARY_SEMANTICS.md
SPEC.md		SPEC.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
WHAT_GLYPH_IS_NOT.md		WHAT_GLYPH_IS_NOT.md
WHY_DETERMINISTIC_RETRIEVAL.md		WHY_DETERMINISTIC_RETRIEVAL.md
glyph_cli.py		glyph_cli.py
glyph_http_server.py		glyph_http_server.py
glyph_segmented_query_v1.py		glyph_segmented_query_v1.py
index.html		index.html
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GLYPH

⚡ Try it in 10 seconds

Index your own file

Build

Documentation

Entry points

Advanced: HTTP server mode

Core guarantees

What problem it solves

When to use

Performance

Status

Contact

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GLYPH

⚡ Try it in 10 seconds

Index your own file

Build

Documentation

Entry points

Advanced: HTTP server mode

Core guarantees

What problem it solves

When to use

Performance

Status

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages