An interactive platform for visualizing Large Language Model inference in real time, with token-level probability analysis and automated content moderation via SafeNudge.
Technologies: FastAPI · PyTorch · Transformers (Hugging Face) · React · D3.js · Vite · Tailwind CSS · Docker · Kubernetes
Safe-Lens streams LLM output token by token and exposes the model's internal probability distribution at each step. Users can inspect alternatives, regenerate from any point in the sequence, and optionally activate SafeNudge — a learned guardrail that detects and redirects unsafe generation in real time.
Key features:
- Real-time token streaming: token-by-token generation with per-token probability distributions
- Interactive editing: click any generated token to view its top-k alternatives; select one to regenerate from that point
- Uncertainty visualization: token background colour encodes model confidence
- SafeNudge guardrail: MLP classifier on hidden states detects unsafe continuations and injects a redirection prompt
- Prefix support: optionally seed the assistant turn with a target prefix before generation begins
- Prometheus metrics:
/metricsendpoint for production monitoring
Two independent Kubernetes deployments behind an HAProxy ingress with SSL termination:
| Component | Image | Notes |
|---|---|---|
| Backend | ghcr.io/dataresponsibly/safe-lens-backend:latest |
Requires 1 GPU |
| Frontend | ghcr.io/dataresponsibly/safe-lens-frontend:latest |
Lightweight, no GPU |
GPU affinity targets: Quadro RTX 8000, NVIDIA A100 (40 GB or 80 GB PCIe), or NVIDIA RTX A6000.
Backend resource allocation:
- Requests: 4 CPU / 16 Gi RAM
- Limits: 8 CPU / 24 Gi RAM / 1 GPU
The startup probe allows up to 20 minutes for model loading before liveness/readiness checks begin.
- Hugging Face token with access to
meta-llama/Llama-3.2-1B-Instruct - kubectl configured against the target cluster
# Create the HuggingFace token secret (once)
kubectl create secret generic hf-token --from-literal=token=YOUR_TOKEN
# Deploy all services (backend, frontend, ingress, monitoring)
kubectl apply -f k8s/
# Check status
kubectl get pods
kubectl get svc# Forward the frontend to http://localhost:8080
kubectl port-forward deployment/frontend 8080:80
# Or forward the backend API directly to http://localhost:8000
kubectl port-forward deployment/backend 8000:8000The ingress in k8s/ingress.yaml controls public visibility via the hpc.nyu.edu/access annotation:
# Public (accessible outside the cluster)
annotations:
hpc.nyu.edu/access: "public"
# Private (remove the annotation or set to "private")
annotations: {}After editing the annotation, apply with replace (not apply) to force the ingress controller to pick up the change:
kubectl replace -f k8s/ingress.yamlProduction URL: https://safe-lens.users.hsrn.nyu.edu
All endpoints accept application/x-www-form-urlencoded (form-encoded) POST bodies. Responses are newline-delimited JSON.
Health check. Returns a list of available API routes.
Stream a token-by-token generation response.
| Parameter | Type | Default | Description |
|---|---|---|---|
init_prompt |
string | required | User prompt (1–10 000 chars) |
k |
int | 30 | Top-k sampling breadth (1–50) |
T |
float | 1.0 | Sampling temperature (>0, ≤10) |
tau |
float | 0.5 | SafeNudge intervention threshold (0–1) |
max_new_tokens |
int | 100 | Max tokens to generate (1–4096) |
sleep_time |
float | 0.0 | Per-token delay in seconds (0–2) |
random_state |
int | null | RNG seed for reproducibility |
safenudge |
bool | false | Enable SafeNudge moderation |
target |
string | null | Assistant turn prefix |
Each streamed object:
{
"idx_counter": 4,
"texts": ["Hello", " Hi", " Hey", ...],
"token_ids": [9906, 13347, 19182, ...],
"probs": [0.42, 0.31, 0.12, ...],
"selected_idx": 0,
"selected_text": "Hello"
}When SafeNudge fires, a sentinel object is emitted before generation resumes:
{ "idx_counter": -1, "texts": [], "token_ids": [], "probs": [], "selected_idx": -1, "selected_text": "[NUDGE]" }Regenerate from a chosen alternative token at a given position.
| Parameter | Type | Description |
|---|---|---|
init_prompt |
string | Original prompt |
content |
string | JSON-serialised token history from the client |
token_pos |
int | Index of the token to branch from |
new_token |
string | Alternative token text to substitute |
k, T, max_new_tokens, sleep_time, random_state |
— | Same as /generate |
Prometheus-compatible metrics endpoint.
SafeNudge is the cognitive guardrail that moderates unsafe LLM outputs without blocking generation entirely.
How it works:
- Generate tokens normally for the first 4 positions (warmup).
- From position 5 onward, extract the last hidden state from the model at each step.
- Run the hidden state through a pre-trained MLP classifier (
api/artifacts/clf_mlp_llama_1b.pkl). - If
P(unsafe) ≥ tau, append a redirection prompt to the full context (including the unsafe tokens generated so far) and continue:
"Sorry, I was going to generate an unsafe response. Instead, let me correct that and make sure the response is very safe, ethical and cannot be used to harm society. Here is an alternate, safe answer to your question:"
- Generation continues from the redirected context. No tokens are discarded.
Implementation notes:
- The classifier uses scikit-learn's
predict_probaon a flattened hidden-state vector. past_key_valuesis reset after a nudge to force a full KV-cache reprefill with the new context.- WildGuard and Qwen3Guard integrations exist in
api/but are disabled in production.
# Format with Black
make code-format
# Lint (black check, mypy, flake8, pylint)
make code-analysis
# Clean build artifacts and caches
make cleanPlaywright-based stress test:
pip install playwright
playwright install chromium
# Full stress test against the deployed instance
python tests/stress_test_playwright.py \
--url "https://safe-lens.users.hsrn.nyu.edu/" \
--users 20 \
--iterations 5 \
--timeout-ms 120000
# Quick sanity check
python tests/stress_test_playwright.py --users 5 --iterations 2safe-lens/
├── api/ # FastAPI backend
│ ├── api.py # Routes: /generate, /regenerate, /metrics
│ ├── safenudge.py # ModelWrapper + SafeNudge guardrail
│ ├── wildguard_safenudge.py # WildGuard integration (disabled)
│ ├── qwenqguard_safenudge.py # Qwen3Guard integration (disabled)
│ ├── _loader.py # Model loading (AutoModelForCausalLM)
│ ├── _output_handler.py # Token streaming and CUDA memory management
│ ├── metrics.py # Prometheus metrics router
│ └── artifacts/
│ └── clf_mlp_llama_1b.pkl # Pre-trained SafeNudge MLP classifier
├── client/ # React frontend (Vite + Tailwind + D3.js)
│ ├── src/
│ │ ├── App.jsx # Application state and layout
│ │ ├── components/
│ │ │ ├── ChatInput.jsx
│ │ │ ├── ChatOutput.jsx # Token rendering with uncertainty colours
│ │ │ ├── Barplot.jsx # D3.js probability bar chart
│ │ │ └── SettingsPanel.jsx
│ │ └── api/stream.js # Streaming fetch helpers
│ └── index.html
├── dev/ # Local development stack
│ ├── docker-compose.yml
│ ├── Dockerfile.backend # CPU-only PyTorch image
│ ├── Dockerfile.frontend # Node/Vite image
│ └── nginx.conf
├── k8s/ # Kubernetes manifests
│ ├── backend.yaml # GPU deployment + ClusterIP service
│ ├── frontend.yaml # Frontend deployment + service
│ ├── ingress.yaml # HAProxy ingress with SSL
│ ├── monitoring-prometheus-*.yaml
│ └── monitoring-grafana.yaml
├── examples/ # Standalone usage scripts
├── tests/
│ └── stress_test_playwright.py
├── Dockerfile.backend # Production GPU image
├── Dockerfile.frontend # Production frontend image
├── Makefile
├── requirements.txt
└── tox.ini
Development — create api/.env:
HF_TOKEN=your_token_here
Production — Kubernetes secret:
kubectl create secret generic hf-token --from-literal=token=YOUR_TOKENPyTorch is installed in production via the CUDA 12.1 index:
--index-url https://download.pytorch.org/whl/cu121