Add pii-openai module: embedded openai/privacy-filter PII redactor by luckfamousa · Pull Request #32 · UMEssen/DeidentiFHIR

luckfamousa · 2026-05-08T14:02:11Z

Summary

A new opt-in Maven module that loads the openai/privacy-filter ONNX model in-process via DJL + ONNX Runtime, and exposes a DeidentifhirHandler[StringType] for free-text FHIR fields (*.text, Observation.note.text, DiagnosticReport.conclusion, …). No Python sidecar.

Stacked on #31 (the multi-module restructure). When that merges, this PR's base auto-rebases to main.

Public API

val pf = PrivacyFilter()                                  // heavy, AutoCloseable
try {
  val redactor = new PrivacyFilterRedactor(pf)            // light
  registry.addHander("redact-pii", redactor.asStringTypeHandler)

  // …compose a Module that uses redact-pii on free-text paths…
} finally pf.close()

Highlights

Lazy model load

PrivacyFilter() construction is cheap — no download, no ONNX session. The 2.85 GB model + tokenizer materialize on the first detect()/redact() call via double-checked locking. close() is a no-op when nothing was ever loaded. Call prewarm() to force eager init behind a k8s readiness probe.

This means a deployment that pulls in deidentifhir-pii-openai but never invokes a redactor pays nothing at runtime — important for configurations where the redactor is one option among several and not always selected.

Constrained Viterbi decoder

ViterbiDecoder over the 33 BIOES labels. Forbidden transitions (B-X → I-Y cross-category, B-X → O drop-without-close, I-X as first token, B-X as last) get -inf scores — argmax-per-token routinely produces these; Viterbi can't. Loads 6 transition biases from the model's viterbi_calibration.json. Default decoder; switchable to argmax via PrivacyFilterConfig.useViterbi=false for diagnostic comparison.

Latency cost is negligible: ~1.6M float ops for ~1500 tokens on 33 labels (sub-ms; dwarfed by model inference).

SHA256-pinned downloads

ModelCache pins SHA256 hashes for every cached file (model shards, tokenizer, config, calibration). Streaming hash via MessageDigest, no whole-file load — important for the 2 GB ONNX shards. verifyOrFetch flow:

cached + correct hash → use
cached + wrong hash → delete + redownload once
missing → download
always verify post-state, fail loudly on persistent mismatch

Catches corrupted partial downloads and silent upstream changes; gives reproducibility (same hashes ↔ same model behavior).

Compatibility with UMEssen's engine

Targets the 4-arg DeidentifhirHandler signature ((Seq[String], T, Seq[Base], Map[String, String]) => T). Only PrivacyFilterRedactor constructs handlers; everything else is engine-independent.
ONNX Runtime is pinned to 1.25.0 (overrides DJL 0.30.0's transitive 1.19.0) because the model's MoE op uses an activation_alpha attribute that ORT 1.19 doesn't recognize.

Tests

32 unit tests, no model needed (mvn -pl pii-openai test runs them in <1 s):

BioDecoderTests (11) — span aggregation, whitespace trim, adjacent merge, score filter, zero-width-offset anchoring.
ViterbiDecoderTests (11) — every illegal transition class rejected, span rescue when a middle token is uncertain, bias-driven path preference between equally-scoring legal paths.
ModelCacheTests (6) — streaming SHA256, all four verifyOrFetch branches via file:// URLs against synthetic fixtures, sanity check that every basename materialize references has a pinned hash.
PrivacyFilterLazyTests (4) — construction does not load, close is a no-op when never used, trivial inputs short-circuit. Uses an unreachable mirror URL so any future regression that re-eagerly initializes fails immediately with a network error.

7 more integration tests gated by -Ddeidentifhir.pii.integration=1 — load the real 2.85 GB model and exercise it on English + German PII. Skipped by default so mvn test stays fast.

Performance (FP16, CPU, M-series Mac)

Cold start: ~600 ms tokenizer + 3-7 s model load (deferred to first use)
Warm latency: 70-130 ms for ≤ 200-char inputs, ~180 ms for ~5000-char inputs (banded attention → sub-linear scaling)
RSS after load: ~3.5 GB; near-zero before first call

Test plan

mvn test from repo root: 34 (core) + 32 (pii-openai unit) + 7 cancelled (gated integration) = 66 succeeded, 0 failed
mvn -pl pii-openai test -Ddeidentifhir.pii.integration=1: full 39 incl. integration
No regression in core (deidentifhir artifact unchanged)

Known follow-ups

INT8/INT4 quantized variants would halve/quarter RSS but use Microsoft contrib ops (GatherBlockQuantized) requiring onnxruntime-extensions. Defer until a benchmark justifies the effort.

🤖 Generated with Claude Code

A new sibling Maven module that loads the openai/privacy-filter ONNX model in-process via DJL + ONNX Runtime — no Python sidecar required. Provides a `DeidentifhirHandler[StringType]` for free-text FHIR fields (`*.text`, `Observation.note.text`, `DiagnosticReport.conclusion`, …). Public API: PrivacyFilter() // AutoCloseable, lazy model load PrivacyFilterRedactor(pf) // adapts to DeidentifhirHandler PrivacyFilterConfig // cache dir, mirror URL, score, etc. Span // detected-span result type Highlights: - **Lazy load**: PrivacyFilter() construction is cheap (no download, no ONNX session). The 2.85 GB model + tokenizer materialize on the first detect()/redact() call. close() is a no-op when never used. Call prewarm() to force eager init behind a k8s readiness probe. - **Constrained Viterbi decoder** over the 33 BIOES labels. Forbidden transitions get -inf scores — argmax-per-token routinely produces illegal sequences (B-X→I-Y, B-X→O, I-X as first token); Viterbi can't. Loads 6 transition biases from the model's viterbi_calibration.json. Default decoder. - **SHA256-pinned downloads**: every cached file is verified against hard-coded hashes on every use. Catches corrupted partial downloads and silent upstream changes; gives reproducibility. Compatibility with UMEssen's engine: - Targets the 4-arg DeidentifhirHandler signature (path, value, parents, static-context). Only PrivacyFilterRedactor cares; the rest of the module is engine-independent. - ONNX Runtime is pinned to 1.25.0 (overrides DJL 0.30.0's transitive 1.19.0) because the openai/privacy-filter MoE op uses an activation_alpha attribute that ORT 1.19 doesn't recognize. Tests (32, no model needed; 7 more gated by -Ddeidentifhir.pii.integration=1): - BioDecoderTests (11) — span aggregation, whitespace trim, adjacent merge, score filter, zero-width-offset anchoring. - ViterbiDecoderTests (11) — every illegal transition class rejected, span rescue when middle token is uncertain, bias-driven preference. - ModelCacheTests (6) — streaming SHA256, all four verifyOrFetch branches via file:// URLs against synthetic fixtures, sanity check that every basename has a pinned hash. - PrivacyFilterLazyTests (4) — construction does not load, close is a no-op, trivial inputs short-circuit, factory does not load. Uses an unreachable mirror URL so any regression that re-eagerly initializes fails fast with a network error. Performance (FP16, CPU, M-series Mac): - Cold start: ~600ms tokenizer + 3-7s model load (deferred to first use) - Warm latency: 70-130ms for ≤200-char inputs, ~180ms for ~5000 chars - RSS after load: ~3.5 GB; near-zero before first call Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

luckfamousa mentioned this pull request May 8, 2026

Add service-http: FHIR $deidentify HTTP microservice #33

Open

5 tasks

luckfamousa assigned tobiasgirardet and unassigned tobiasgirardet May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pii-openai module: embedded openai/privacy-filter PII redactor#32

Add pii-openai module: embedded openai/privacy-filter PII redactor#32
luckfamousa wants to merge 1 commit into
ume/multi-modulefrom
ume/pii-openai

luckfamousa commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

luckfamousa commented May 8, 2026

Summary

Public API

Highlights

Lazy model load

Constrained Viterbi decoder

SHA256-pinned downloads

Compatibility with UMEssen's engine

Tests

Performance (FP16, CPU, M-series Mac)

Test plan

Known follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants