Add pii-openai module: embedded openai/privacy-filter PII redactor#32
Open
luckfamousa wants to merge 1 commit into
Open
Add pii-openai module: embedded openai/privacy-filter PII redactor#32luckfamousa wants to merge 1 commit into
luckfamousa wants to merge 1 commit into
Conversation
A new sibling Maven module that loads the openai/privacy-filter ONNX model in-process via DJL + ONNX Runtime — no Python sidecar required. Provides a `DeidentifhirHandler[StringType]` for free-text FHIR fields (`*.text`, `Observation.note.text`, `DiagnosticReport.conclusion`, …). Public API: PrivacyFilter() // AutoCloseable, lazy model load PrivacyFilterRedactor(pf) // adapts to DeidentifhirHandler PrivacyFilterConfig // cache dir, mirror URL, score, etc. Span // detected-span result type Highlights: - **Lazy load**: PrivacyFilter() construction is cheap (no download, no ONNX session). The 2.85 GB model + tokenizer materialize on the first detect()/redact() call. close() is a no-op when never used. Call prewarm() to force eager init behind a k8s readiness probe. - **Constrained Viterbi decoder** over the 33 BIOES labels. Forbidden transitions get -inf scores — argmax-per-token routinely produces illegal sequences (B-X→I-Y, B-X→O, I-X as first token); Viterbi can't. Loads 6 transition biases from the model's viterbi_calibration.json. Default decoder. - **SHA256-pinned downloads**: every cached file is verified against hard-coded hashes on every use. Catches corrupted partial downloads and silent upstream changes; gives reproducibility. Compatibility with UMEssen's engine: - Targets the 4-arg DeidentifhirHandler signature (path, value, parents, static-context). Only PrivacyFilterRedactor cares; the rest of the module is engine-independent. - ONNX Runtime is pinned to 1.25.0 (overrides DJL 0.30.0's transitive 1.19.0) because the openai/privacy-filter MoE op uses an activation_alpha attribute that ORT 1.19 doesn't recognize. Tests (32, no model needed; 7 more gated by -Ddeidentifhir.pii.integration=1): - BioDecoderTests (11) — span aggregation, whitespace trim, adjacent merge, score filter, zero-width-offset anchoring. - ViterbiDecoderTests (11) — every illegal transition class rejected, span rescue when middle token is uncertain, bias-driven preference. - ModelCacheTests (6) — streaming SHA256, all four verifyOrFetch branches via file:// URLs against synthetic fixtures, sanity check that every basename has a pinned hash. - PrivacyFilterLazyTests (4) — construction does not load, close is a no-op, trivial inputs short-circuit, factory does not load. Uses an unreachable mirror URL so any regression that re-eagerly initializes fails fast with a network error. Performance (FP16, CPU, M-series Mac): - Cold start: ~600ms tokenizer + 3-7s model load (deferred to first use) - Warm latency: 70-130ms for ≤200-char inputs, ~180ms for ~5000 chars - RSS after load: ~3.5 GB; near-zero before first call Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A new opt-in Maven module that loads the
openai/privacy-filterONNX model in-process via DJL + ONNX Runtime, and exposes aDeidentifhirHandler[StringType]for free-text FHIR fields (*.text,Observation.note.text,DiagnosticReport.conclusion, …). No Python sidecar.Public API
Highlights
Lazy model load
PrivacyFilter()construction is cheap — no download, no ONNX session. The 2.85 GB model + tokenizer materialize on the firstdetect()/redact()call via double-checked locking.close()is a no-op when nothing was ever loaded. Callprewarm()to force eager init behind a k8s readiness probe.This means a deployment that pulls in
deidentifhir-pii-openaibut never invokes a redactor pays nothing at runtime — important for configurations where the redactor is one option among several and not always selected.Constrained Viterbi decoder
ViterbiDecoderover the 33 BIOES labels. Forbidden transitions (B-X → I-Ycross-category,B-X → Odrop-without-close,I-Xas first token,B-Xas last) get -inf scores — argmax-per-token routinely produces these; Viterbi can't. Loads 6 transition biases from the model'sviterbi_calibration.json. Default decoder; switchable to argmax viaPrivacyFilterConfig.useViterbi=falsefor diagnostic comparison.Latency cost is negligible: ~1.6M float ops for ~1500 tokens on 33 labels (sub-ms; dwarfed by model inference).
SHA256-pinned downloads
ModelCachepins SHA256 hashes for every cached file (model shards, tokenizer, config, calibration). Streaming hash viaMessageDigest, no whole-file load — important for the 2 GB ONNX shards.verifyOrFetchflow:Catches corrupted partial downloads and silent upstream changes; gives reproducibility (same hashes ↔ same model behavior).
Compatibility with UMEssen's engine
DeidentifhirHandlersignature ((Seq[String], T, Seq[Base], Map[String, String]) => T). OnlyPrivacyFilterRedactorconstructs handlers; everything else is engine-independent.activation_alphaattribute that ORT 1.19 doesn't recognize.Tests
32 unit tests, no model needed (
mvn -pl pii-openai testruns them in <1 s):BioDecoderTests(11) — span aggregation, whitespace trim, adjacent merge, score filter, zero-width-offset anchoring.ViterbiDecoderTests(11) — every illegal transition class rejected, span rescue when a middle token is uncertain, bias-driven path preference between equally-scoring legal paths.ModelCacheTests(6) — streaming SHA256, all fourverifyOrFetchbranches viafile://URLs against synthetic fixtures, sanity check that every basenamematerializereferences has a pinned hash.PrivacyFilterLazyTests(4) — construction does not load, close is a no-op when never used, trivial inputs short-circuit. Uses an unreachable mirror URL so any future regression that re-eagerly initializes fails immediately with a network error.7 more integration tests gated by
-Ddeidentifhir.pii.integration=1— load the real 2.85 GB model and exercise it on English + German PII. Skipped by default somvn teststays fast.Performance (FP16, CPU, M-series Mac)
Test plan
mvn testfrom repo root: 34 (core) + 32 (pii-openai unit) + 7 cancelled (gated integration) = 66 succeeded, 0 failedmvn -pl pii-openai test -Ddeidentifhir.pii.integration=1: full 39 incl. integrationKnown follow-ups
GatherBlockQuantized) requiringonnxruntime-extensions. Defer until a benchmark justifies the effort.🤖 Generated with Claude Code