Skip to content

Add pii-openai module: embedded openai/privacy-filter PII redactor#32

Open
luckfamousa wants to merge 1 commit into
ume/multi-modulefrom
ume/pii-openai
Open

Add pii-openai module: embedded openai/privacy-filter PII redactor#32
luckfamousa wants to merge 1 commit into
ume/multi-modulefrom
ume/pii-openai

Conversation

@luckfamousa

Copy link
Copy Markdown
Contributor

Summary

A new opt-in Maven module that loads the openai/privacy-filter ONNX model in-process via DJL + ONNX Runtime, and exposes a DeidentifhirHandler[StringType] for free-text FHIR fields (*.text, Observation.note.text, DiagnosticReport.conclusion, …). No Python sidecar.

Stacked on #31 (the multi-module restructure). When that merges, this PR's base auto-rebases to main.

Public API

val pf = PrivacyFilter()                                  // heavy, AutoCloseable
try {
  val redactor = new PrivacyFilterRedactor(pf)            // light
  registry.addHander("redact-pii", redactor.asStringTypeHandler)

  // …compose a Module that uses redact-pii on free-text paths…
} finally pf.close()

Highlights

Lazy model load

PrivacyFilter() construction is cheap — no download, no ONNX session. The 2.85 GB model + tokenizer materialize on the first detect()/redact() call via double-checked locking. close() is a no-op when nothing was ever loaded. Call prewarm() to force eager init behind a k8s readiness probe.

This means a deployment that pulls in deidentifhir-pii-openai but never invokes a redactor pays nothing at runtime — important for configurations where the redactor is one option among several and not always selected.

Constrained Viterbi decoder

ViterbiDecoder over the 33 BIOES labels. Forbidden transitions (B-X → I-Y cross-category, B-X → O drop-without-close, I-X as first token, B-X as last) get -inf scores — argmax-per-token routinely produces these; Viterbi can't. Loads 6 transition biases from the model's viterbi_calibration.json. Default decoder; switchable to argmax via PrivacyFilterConfig.useViterbi=false for diagnostic comparison.

Latency cost is negligible: ~1.6M float ops for ~1500 tokens on 33 labels (sub-ms; dwarfed by model inference).

SHA256-pinned downloads

ModelCache pins SHA256 hashes for every cached file (model shards, tokenizer, config, calibration). Streaming hash via MessageDigest, no whole-file load — important for the 2 GB ONNX shards. verifyOrFetch flow:

  • cached + correct hash → use
  • cached + wrong hash → delete + redownload once
  • missing → download
  • always verify post-state, fail loudly on persistent mismatch

Catches corrupted partial downloads and silent upstream changes; gives reproducibility (same hashes ↔ same model behavior).

Compatibility with UMEssen's engine

  • Targets the 4-arg DeidentifhirHandler signature ((Seq[String], T, Seq[Base], Map[String, String]) => T). Only PrivacyFilterRedactor constructs handlers; everything else is engine-independent.
  • ONNX Runtime is pinned to 1.25.0 (overrides DJL 0.30.0's transitive 1.19.0) because the model's MoE op uses an activation_alpha attribute that ORT 1.19 doesn't recognize.

Tests

32 unit tests, no model needed (mvn -pl pii-openai test runs them in <1 s):

  • BioDecoderTests (11) — span aggregation, whitespace trim, adjacent merge, score filter, zero-width-offset anchoring.
  • ViterbiDecoderTests (11) — every illegal transition class rejected, span rescue when a middle token is uncertain, bias-driven path preference between equally-scoring legal paths.
  • ModelCacheTests (6) — streaming SHA256, all four verifyOrFetch branches via file:// URLs against synthetic fixtures, sanity check that every basename materialize references has a pinned hash.
  • PrivacyFilterLazyTests (4) — construction does not load, close is a no-op when never used, trivial inputs short-circuit. Uses an unreachable mirror URL so any future regression that re-eagerly initializes fails immediately with a network error.

7 more integration tests gated by -Ddeidentifhir.pii.integration=1 — load the real 2.85 GB model and exercise it on English + German PII. Skipped by default so mvn test stays fast.

Performance (FP16, CPU, M-series Mac)

  • Cold start: ~600 ms tokenizer + 3-7 s model load (deferred to first use)
  • Warm latency: 70-130 ms for ≤ 200-char inputs, ~180 ms for ~5000-char inputs (banded attention → sub-linear scaling)
  • RSS after load: ~3.5 GB; near-zero before first call

Test plan

  • mvn test from repo root: 34 (core) + 32 (pii-openai unit) + 7 cancelled (gated integration) = 66 succeeded, 0 failed
  • mvn -pl pii-openai test -Ddeidentifhir.pii.integration=1: full 39 incl. integration
  • No regression in core (deidentifhir artifact unchanged)

Known follow-ups

  • INT8/INT4 quantized variants would halve/quarter RSS but use Microsoft contrib ops (GatherBlockQuantized) requiring onnxruntime-extensions. Defer until a benchmark justifies the effort.

🤖 Generated with Claude Code

A new sibling Maven module that loads the openai/privacy-filter ONNX
model in-process via DJL + ONNX Runtime — no Python sidecar required.
Provides a `DeidentifhirHandler[StringType]` for free-text FHIR fields
(`*.text`, `Observation.note.text`, `DiagnosticReport.conclusion`, …).

Public API:
  PrivacyFilter()                    // AutoCloseable, lazy model load
  PrivacyFilterRedactor(pf)          // adapts to DeidentifhirHandler
  PrivacyFilterConfig                // cache dir, mirror URL, score, etc.
  Span                               // detected-span result type

Highlights:
- **Lazy load**: PrivacyFilter() construction is cheap (no download, no
  ONNX session). The 2.85 GB model + tokenizer materialize on the first
  detect()/redact() call. close() is a no-op when never used. Call
  prewarm() to force eager init behind a k8s readiness probe.
- **Constrained Viterbi decoder** over the 33 BIOES labels. Forbidden
  transitions get -inf scores — argmax-per-token routinely produces
  illegal sequences (B-X→I-Y, B-X→O, I-X as first token); Viterbi
  can't. Loads 6 transition biases from the model's
  viterbi_calibration.json. Default decoder.
- **SHA256-pinned downloads**: every cached file is verified against
  hard-coded hashes on every use. Catches corrupted partial downloads
  and silent upstream changes; gives reproducibility.

Compatibility with UMEssen's engine:
- Targets the 4-arg DeidentifhirHandler signature
  (path, value, parents, static-context). Only PrivacyFilterRedactor
  cares; the rest of the module is engine-independent.
- ONNX Runtime is pinned to 1.25.0 (overrides DJL 0.30.0's transitive
  1.19.0) because the openai/privacy-filter MoE op uses an
  activation_alpha attribute that ORT 1.19 doesn't recognize.

Tests (32, no model needed; 7 more gated by -Ddeidentifhir.pii.integration=1):
- BioDecoderTests (11) — span aggregation, whitespace trim, adjacent
  merge, score filter, zero-width-offset anchoring.
- ViterbiDecoderTests (11) — every illegal transition class rejected,
  span rescue when middle token is uncertain, bias-driven preference.
- ModelCacheTests (6) — streaming SHA256, all four verifyOrFetch
  branches via file:// URLs against synthetic fixtures, sanity check
  that every basename has a pinned hash.
- PrivacyFilterLazyTests (4) — construction does not load, close is a
  no-op, trivial inputs short-circuit, factory does not load. Uses
  an unreachable mirror URL so any regression that re-eagerly
  initializes fails fast with a network error.

Performance (FP16, CPU, M-series Mac):
- Cold start: ~600ms tokenizer + 3-7s model load (deferred to first use)
- Warm latency: 70-130ms for ≤200-char inputs, ~180ms for ~5000 chars
- RSS after load: ~3.5 GB; near-zero before first call

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants