feat: automated compliance monitoring with scheduled re-classification#825
feat: automated compliance monitoring with scheduled re-classification#825madsysharma wants to merge 12 commits into
Conversation
|
Hi @SdSarthak , please review my PR for this issue. Thanks. |
|
Hi @SdSarthak , please review this updated PR. Thank you. |
SdSarthak
left a comment
There was a problem hiding this comment.
This PR has several blocking issues:
-
Destructive migration — The migration drops the
rag_queriestable inupgrade(). This is a destructive operation that would wipe query history in production. This must not be in the migration. -
Drops unique constraint — The migration removes
uq_ai_system_owner_nameadded by a recent merge. This is a regression. If this PR doesn't depend on that constraint being removed, please regenerate the migration from a freshalembic revision --autogenerate. -
Missing model — The code imports
from app.models.compliance_drift_event import ComplianceDriftEventbut this model file does not exist. The server will fail to start with anImportError. -
Breaking API change — Changing
PUT /{system_id}toPATCHbreaks existing clients without a versioning path. -
Missing
apschedulerdependency — If the scheduler uses apscheduler, it must be added torequirements.txt.
Please: (a) regenerate the migration without destructive ops, (b) add the missing ComplianceDriftEvent model, (c) keep the PUT endpoint or add PATCH as a separate route, (d) verify all dependencies are in requirements.txt.
e75c130 to
895f5c4
Compare
|
Hi @SdSarthak , please check this updated PR. Not sure why you're seeing the outdated versions of it, but some of the issues you mentioned were already covered in the previous commits. Thank you. |
d8ef223 to
b023a36
Compare
feat(compliance): scheduled drift monitoring with notifications & webhooks
Closes issue #82
Summary
This PR implements automated compliance monitoring that re-runs the EU AI Act risk classifier on a nightly schedule against every AI system whose owner has opted in. When the result diverges from the stored classification - risk level, status, or the classifier version itself - a
ComplianceDriftEventrow is persisted, an in-app notification is created for the owner, and an optional per system webhook is delivered with an HMAC-SHA256 signature.Rationale
Owners shouldn't have to remember to manually re-classify after a regulation update, a questionnaire-logic change, or a slow drift in how their system is used. The job runs at 02:00 UTC by default, costs nothing when no drift is detected (just a quick comparison), and gives the operators an audit trail.
What's in the PR
New backend modules
backend/app/models/compliance_drift_event.py: one row per detected drift. Stores before/after risk + status, the drift type (risk_change,status_change,classifier_version_change,mixed), the classifier version that produced it, and webhook delivery outcome.backend/app/modules/compliance/monitor.py: the scanning logic. Iterates monitored systems in chunks, re-builds aRiskClassificationRequestfrom each system's storedquestionnaire_responses, calls the existingclassify_risk(), and diffs against current state. Acquires a Postgres advisory lock so multi-replica deployments don't run the scan twice.backend/app/modules/compliance/notifier.py: in-app + webhook dispatch. Webhook bodies are signed with HMAC-SHA256 (X-AegisAI-Signature: sha256=<hex>), use canonical JSON for deterministic signing, and retry on 5xx / network errors with exponential backoff (1–10s, 3 attempts). 4xx responses are terminal.backend/app/core/scheduler.py: APSchedulerAsyncIOSchedulerstarted in the FastAPI lifespan. Cron expression in env varCOMPLIANCE_MONITOR_CRON(default0 2 * * *; empty disables).backend/app/api/v1/compliance_monitoring.py: five endpoints:GET /ai-systems/{id}/monitoring: read settingsPATCH /ai-systems/{id}/monitoring: toggle + webhook URLPOST /ai-systems/{id}/monitoring/rotate-secret: generate HMAC secret (returned exactly once)GET /ai-systems/{id}/drift-events: paginated historyPOST /admin/compliance/scan: manual triggerSchema changes
AISystemgains three columns:monitoring_enabled(defaulttrue),webhook_url,webhook_secret. New tablecompliance_drift_eventswith FK + cascade delete fromai_systems. The existingNotificationmodel (already hadCOMPLIANCE_DRIFTin its enum, just not registered) is now wired up viaUser.notificationsback-population.The repo uses
Base.metadata.create_all()at startup, so all schema changes apply automatically on first boot. There's no Alembic migration in this PR: none of the existing models has one either; I would recommend following up by introducing Alembic across the whole project rather than just bare bones implementations.Tests
backend/tests/test_compliance_monitor.py: 12 tests across four classes:TestMonitorScan: no-drift case, risk-change case, disabled systems skipped, systems with no questionnaire skipped, classifier-version-change case.TestInAppNotifications: notification row created, fields correct, event flaggednotified_in_app=True.TestWebhookDispatch: HMAC signature matches; retry-on-5xx succeeds on third attempt; 4xx is terminal; no webhook when URL unset.TestMonitoringEndpoints: PATCH settings, rotate-secret shows secret exactly once, drift events list paginates, returns 404 for another user's system (no enumeration leak).All scenarios validated standalone in 0.4s; full pytest collection will work once
conftestfixturesdb_session,client,test_user,other_user,auth_headersare present (they already exist in the repo'stests/conftest.py).Docs
docs/compliance/drift-monitoring.md: schedule + cron config, per-system settings, drift type taxonomy, full webhook payload schema, signature verification snippets in Python + Node.js, retry semantics, operational concerns, manual-trigger example.Design decisions
APScheduler over Celery Beat. No Celery worker is deployed today. Adding one for a nightly job would be more infra than warranted. The in-process scheduler is started/stopped in the lifespan handler so uvicorn's
--reloaddoesn't leak schedulers.pg_try_advisory_lockfor multi-replica safety. Each replica's scheduler fires independently. The advisory lock guarantees exactly one replica runs the scan; the others log a skip and exit. On non-Postgres backends (SQLite in tests) the lock helper is a no-op.Classifier indirection (
_resolve_classifier). The classifier lives inapp.api.v1.classification, which transitively imports langchain: a problem for unit tests that want to stub it without installing the full RAG stack. The monitor calls_resolve_classifier()(request)instead of importingclassify_riskat module load; tests override_resolve_classifier.Updated
risk_levelafter a drift detection. Otherwise every scan would re-flag the same change forever. The drift event row retains the old -> new values for the audit trail; the system row moves to the new baseline.Bump
CLASSIFIER_VERSIONto trigger re-review notifications. When the questionnaire logic changes (classify_risk()inclassification.py), bumping the version constant emits aclassifier_version_changedrift event for every monitored system on the next scan, even when the actual result didn't change. This is the audit trail signal owners need to re-review their classification in light of new logic.Webhook secret returned exactly once.
POST /rotate-secretis the only path that returns the secret in the response. PATCH doesn't, GET doesn't. This prevents the common mistake of fetching monitoring settings and accidentally logging the secret in client code.Canonical JSON for webhook bodies. Sorted keys, no whitespace. Required for HMAC to be reproducible: without this, every encoding-library would compute a slightly different signature.
Testing
Manual end-to-end:
Out-of-scope
Webhook re-delivery job. A drift event whose webhook delivery exhausted the in-scan retries is not re-attempted on subsequent scans. The state is recorded on the event row (
webhook_response_code,webhook_error) for now; a periodic re-delivery sweep is a follow-up.Admin role for
/admin/compliance/scan. Currently any authenticated user can trigger a scan. The monitor only touches systems withmonitoring_enabled=True, so the operation is effectively scoped: but a proper admin role / RBAC layer is worthwhile and not in this PR.Frontend toggle + drift events list. A minimal UI was outside the scope of this PR to keep the diff manageable. The backend API is fully usable today via curl; UI follow-up tracked separately.
Alembic introduction. This PR uses
create_all()like every other model in the repo. I would recommend a parallel project-wide effort to introduce Alembic rather than mixing the two patterns here.Compatibility
COMPLIANCE_MONITOR_CRON.APScheduler==3.10.4,tenacity==8.2.3.NotificationandComplianceDriftEventtables are auto-created.monitoring_enabled=True: owners who don't want it can opt out via the PATCH endpoint.