African PII detection that cites the law it enforces.
Warning
Status: pre-beta (development) - arche-core is under active development.
arche-core detects PII for African jurisdictions; government IDs, names, phone numbers, addresses, and grounds every detection in the data protection statute that governs it. NDPA, POPIA, Kenya DPA, Ghana DPA. Six closed policy actions. Composes with Presidio, GLiNER, and Splink.
Presidio detects PII. GLiNER does multilingual NER. Splink links records. None of them know that a BVN is sensitive under NDPA §30, or that "Adeyẹmí" and "Adeyemi" are the same Yoruba name with and without tonal marks, or that "behind Total filling station, Madina Junction" is a parseable Ghanaian address.
arche-coredoes that one job.
from arche import Pipeline
pipeline = Pipeline(jurisdiction="NG") # auto-loads NDPA-2023
result = pipeline.process(
"Fatima Abdullahi, NIN 12345678901, BVN 22100987654."
)
for d in result.detections:
print(f"{d.category:11} tier={d.sensitivity_tier.value:9} {d.regulatory_citation}")
# PII-2-BVN tier=high NDPA-2023 s.30, CBN BVN policy 2014
# PII-2-NIN tier=high NDPA-2023 s.30, NIMC Act s.27
# PII-1-NAME tier=moderate NDPA-2023 s.30 (×2 — given + family name)
print(result.redacted_text)
# NAME_... NAME_..., NIN [NIN], BVN [BVN].Same code works for jurisdiction="ZA" (POPIA), "KE" (Kenya DPA), "GH" (Ghana DPA). Four launch jurisdictions, four DPA-grounded statute YAML files, one composable framework.
pip install arche-core # ~310KB base — pure-Python detectors, statute policy
pip install arche-core[all] # everything (GLiNER + Presidio + Splink + docling + LLM)(Or uv add arche-core / uv add arche-core[all].) Heavy capabilities are opt-in extras:
| Extra | Adds |
|---|---|
arche-core[detect] |
GLiNER2-PII via ONNX runtime (multilingual neural soft-PII) |
arche-core[presidio] |
Microsoft Presidio recognizer plugin |
arche-core[resolve] |
Splink + DuckDB for large-scale entity resolution |
arche-core[doc] |
docling for PDF / DOCX / PPTX / XLSX ingestion |
Per-launch-jurisdiction detection coverage. Every detector validates check-digits where the underlying spec supports it.
| Jurisdiction | Statute | Detectors |
|---|---|---|
| Nigeria (NG) | NDPA-2023 | NIN (11 digits), BVN (11 digits, 22-prefix), TIN, RC, voter PVC, driver's licence |
| Kenya (KE) | Kenya DPA 2019 | National ID, KRA PIN, NHIF |
| South Africa (ZA) | POPIA | SA ID (13-digit Luhn + DOB/gender/citizenship decode), tax reference, passport |
| Ghana (GH) | Ghana DPA 2012 | Ghana Card, SSNIT, TIN |
| + 11 more African patterns | — | Egypt, Uganda, Rwanda, Tanzania, Cameroon, Senegal, ... |
Plus libphonenumber-backed normalization for 30+ African phone networks, landmark-anchored address parsing for NG and ZA, and currency detection (Naira, Cedi, Rand, CFA).
Every detection emits a category, a sensitivity tier (high / moderate / low), and the specific statute section that classifies it. The Pipeline maps each to one of six closed actions — mask, tokenize, drop, generalize, audit, retain — per the configured jurisdiction's statute YAML.
for o in result.policy_outcomes:
print(o.category, o.action, o.statute_reference)
# PII-2-BVN mask NDPA-2023 s.30, CBN BVN policy 2014
# PII-2-NIN mask NDPA-2023 s.30, NIMC Act s.27
# PII-1-NAME tokenize NDPA-2023 s.30Statute YAMLs live at arche/policy/_data/<STATUTE-ID>.yaml and are human-readable. Statute amendments are policy-file changes, not code changes.
arche-core ships a 114-group African name equivalence lexicon covering 454 name forms across 50+ ethnic traditions:
- Mohammed = Muhammad = Mamadou = Muhammadu (Pan-Islamic)
- Diallo = Jallow = Jalloh (Fulani cross-ethnic orthography)
- Fatou = Fatoumata (West African diminutive)
- Adeyemi = Adeyẹmi = Adeyẹmí (Yoruba tonal marks)
- Pierre = Peter = Pedro (colonial-era cross-linguistic)
- Irorere, Aibuedfe (Benin/Edo names with semantic meaning)
Growing via Wikidata + community curation. See datasets/ for the full dataset and contribution guide.
arche-core is designed to compose with the incumbent tools, not replace them. The three integration patterns:
# Presidio's English recognizers + arche's African recognizers
pip install arche-core[presidio]
# arche.detect.presidio surfaces both as one recognizer set.
# GLiNER's multilingual NER + arche's statute classification
pip install arche-core[detect]
# Pipeline(jurisdiction="NG", backend="gliner") routes soft-PII through GLiNER.
# Splink's record linkage + arche's jurisdiction-aware comparators
pip install arche-core[resolve]
# Statute-tagged detections feed Splink as clean inputs.arche.graph.audit ships an SQLite-backed append-only log that records every detection, every policy decision, and every action taken — queryable by compliance officers and regulators. PII values are never stored; only categories, span offsets, and document hashes. Markdown compliance report generator for regulator-ready exports.
These ship in the package but are not in the headline pitch — they support specific identity workflows on top of the detection layer:
arche.sign— Ed25519 + JWS + did:key signing forPipeline.Resultenvelopes. SD-JWT-VC issue / verify viaarche.credentials.sd_jwt. Seeexamples/02_sign_share_extract.pyandexamples/04_sd_jwt_credential.py.arche.workflow.dsar— citizen-side DSAR draft generation with per-jurisdiction statute citations. Seeexamples/03_dsar_workflow.py.arche.resolve— lightweight Fellegi-Sunter matcher with jurisdiction-specific priors.from arche import matchfor two-record comparison;from arche import linkfor cross-source resolution.arche.workflow._review— MPI review queue for human-in-the-loop match decisions. Not on the public surface; import from the canonical path.arche.resolve_places/arche.list_places— jurisdictional place lookup with verifiable audit receipts.
These are real tools we depend on internally. They are not the lead pitch.
Apache 2.0. By Unpatterned Labs.
arche-coreis under active development. APIs may change between alpha releases.