Privacy-preserving pattern extraction from unstructured data at scale.
A production-ready pipeline that summarizes, embeds, and clusters text data to surface actionable insights without exposing PII. Supports conversations, support logs, LLM traces, user feedback, and any unstructured text.
Inspired by Anthropic's CLIO.
- Structured Extraction: Pydantic-based structured outputs with OpenAI and Anthropic APIs
- Multiple Clustering Algorithms: K-means and HDBSCAN with automatic optimization
- Privacy-First: PII stripping and minimum cluster size thresholds
- Hierarchical Reduction: Multi-level cluster aggregation with auto-labeling
- Interactive Visualizations: Bubble grids, hierarchy cards, treemaps, sunbursts, and dashboards
- Checkpoint/Resume: Long-running pipelines can be paused and resumed
- Caching: Embedding and summary caching to reduce API costs
- Multiple Data Sources: JSON, JSONL, CSV, and HuggingFace datasets
Using uv (recommended):
uv pip install -e .With all optional dependencies:
uv pip install -e ".[all]"Or with pip:
pip install -e .Copy .env.example to .env and configure:
# Required for summarization and embeddings
OPENAI_API_KEY=sk-...
# Optional: for Claude-based summarization
ANTHROPIC_API_KEY=sk-ant-...
# Optional: for private HuggingFace datasets
HF_TOKEN=hf_...import asyncio
from cernis import Pipeline
async def main():
pipeline = Pipeline()
results = await pipeline.run("./data/conversations.jsonl")
# Print cluster hierarchy
results.print_hierarchy()
# Export to JSON
results.export("./output/clusters.json")
asyncio.run(main())# Run the pipeline
cernis run ./data/conversations.jsonl --output ./output
# With custom settings
cernis run ./data/traces.jsonl \
--clusters 100 \
--model gpt-4o-mini \
--embedder text-embedding-3-small \
--algorithm kmeans
# Use Claude for summarization
cernis run ./data/tickets.jsonl --model claude-sonnet-4-20250514
# Resume from checkpoint
cernis run ./data/large_dataset.jsonl --resumeInput Summarize Embed Cluster Reduce
| | | | |
JSON/CSV Structured OpenAI or K-means or Hierarchical
JSONL Extraction Local HDBSCAN Aggregation
HF Datasets (Pydantic) Embeddings + Labeling
Load data from multiple sources with automatic format detection.
from cernis import Pipeline
pipeline = Pipeline()
# Local files
results = await pipeline.run("./data/tickets.jsonl")
results = await pipeline.run("./data/logs.json")
results = await pipeline.run("./data/feedback.csv")
# HuggingFace datasets
from cernis.ingest import load_hf_dataset
records = load_hf_dataset(
"lmsys/lmsys-chat-1m",
content_field="conversation",
max_records=10000
)
results = await pipeline.run(records)Recommended public datasets:
lmsys/lmsys-chat-1m- 1M real LLM conversationsallenai/WildChat-1M- 1M ChatGPT conversationsHuggingFaceH4/ultrachat_200k- 200K dialoguestatsu-lab/alpaca- 52K instruction examples
Extract structured information using Pydantic models with OpenAI or Anthropic.
from cernis.summarize import OpenAISummarizer, AnthropicSummarizer
# OpenAI (default)
summarizer = OpenAISummarizer(model="gpt-4o-mini")
# Anthropic
summarizer = AnthropicSummarizer(model="claude-sonnet-4-20250514")Structured extraction output:
{
"summary": "User requested password reset...",
"intent": "Account recovery",
"topic": "Authentication",
"outcome": "success", # success | partial | failure | unknown
"sentiment": "neutral", # positive | neutral | negative
"key_points": ["Password expired", "Reset link sent"]
}Generate vector embeddings for semantic clustering.
from cernis.embed import OpenAIEmbedder, LocalEmbedder
# OpenAI API
embedder = OpenAIEmbedder(model="text-embedding-3-small")
# Local model (no API required)
embedder = LocalEmbedder(model="all-MiniLM-L6-v2")Group similar records using clustering algorithms.
from cernis.cluster import KMeansClusterer, HDBSCANClusterer
# K-means (fixed number of clusters)
clusterer = KMeansClusterer(n_clusters=50)
# HDBSCAN (automatic cluster detection)
clusterer = HDBSCANClusterer(min_cluster_size=10)Aggregate clusters into hierarchical groups with automatic labeling.
from cernis.reduce import HierarchicalReducer
reducer = HierarchicalReducer(
levels=3,
min_clusters_per_level=5
)from cernis import Pipeline, PipelineConfig
from cernis.core.config import (
SummarizerConfig,
EmbedderConfig,
ClustererConfig,
PrivacyConfig,
CacheConfig,
)
config = PipelineConfig(
summarizer=SummarizerConfig(
provider="openai",
model="gpt-4o-mini",
max_tokens=500,
),
embedder=EmbedderConfig(
provider="openai",
model="text-embedding-3-small",
),
clusterer=ClustererConfig(
algorithm="kmeans",
n_clusters=50,
),
privacy=PrivacyConfig(
strip_pii=True,
min_cluster_size=10,
),
cache=CacheConfig(
enabled=True,
cache_dir=".cernis_cache",
),
)
pipeline = Pipeline(config=config)# cernis.yaml
summarizer:
provider: openai
model: gpt-4o-mini
max_tokens: 500
embedder:
provider: openai
model: text-embedding-3-small
clusterer:
algorithm: kmeans
n_clusters: 50
privacy:
strip_pii: true
min_cluster_size: 10
cache:
enabled: true
cache_dir: .cernis_cacheUse with CLI:
cernis run ./data/tickets.jsonl --config cernis.yaml# Bubble grid (CLIO-style percentages)
cernis viz ./output/clusters.json --chart bubble -n 10
# Hierarchical cards
cernis viz ./output/clusters.json --chart hierarchy -n 5
# Full CLIO report (bubble grid + details)
cernis viz ./output/clusters.json --chart clio --title "Support Analysis"
# Dashboard with multiple panels
cernis viz ./output/clusters.json --chart dashboard
# Individual charts
cernis viz ./output/clusters.json --chart bar -n 10
cernis viz ./output/clusters.json --chart treemap
cernis viz ./output/clusters.json --chart sunburst
cernis viz ./output/clusters.json --chart pie -n 8
# Export to HTML file
cernis viz ./output/clusters.json --chart clio -o report.htmlfrom cernis.viz import ClusterVisualizer, plot_clusters
# From pipeline results
viz = ClusterVisualizer(result=results)
# Or from exported JSON
viz = ClusterVisualizer(data_path="./output/clusters.json")
# CLIO-style visualizations
viz.bubble_grid(n=10, title="Top Use Cases")
viz.clio_report(n_bubbles=10, title="Analysis Report")
# Hierarchical visualization
viz.hierarchy_cards(n_groups=5)
viz.to_hierarchy_html("hierarchy.html", n_groups=5)
# Standard charts
viz.bar_chart(n=10)
viz.treemap()
viz.sunburst()
viz.pie_chart(n=8)
viz.dashboard()
# Export all to HTML
viz.to_html("report.html")from cernis import Pipeline
from cernis.core.config import PipelineConfig, PrivacyConfig
config = PipelineConfig(
privacy=PrivacyConfig(
strip_pii=True, # Remove PII before processing
min_cluster_size=10, # Suppress small clusters
)
)
pipeline = Pipeline(config=config)PII patterns detected and stripped:
- Email addresses
- Phone numbers (US and international formats)
- Social Security Numbers
- Credit card numbers
- IP addresses
- API keys and tokens
- AWS access keys
- JWTs
Reduce API costs by caching embeddings and summaries:
from cernis import Pipeline
from cernis.core.config import PipelineConfig, CacheConfig
config = PipelineConfig(
cache=CacheConfig(
enabled=True,
cache_dir=".cernis_cache",
ttl_seconds=86400 * 7, # 7 days
)
)
pipeline = Pipeline(config=config)Disable caching via CLI:
cernis run ./data/tickets.jsonl --no-cacheFor large datasets, the pipeline supports checkpointing:
# Start processing (creates checkpoint on interrupt)
cernis run ./data/large_dataset.jsonl --output ./output
# Resume from checkpoint
cernis run ./data/large_dataset.jsonl --output ./output --resumeThe examples/data/ directory contains sample datasets for testing:
| File | Format | Records | Description |
|---|---|---|---|
support_tickets.jsonl |
JSONL | 30 | Support tickets with categories |
support_tickets.json |
JSON | 30 | Same data in JSON array format |
support_tickets.csv |
CSV | 30 | Same data in CSV format |
llm_traces.jsonl |
JSONL | 20 | LLM conversation traces |
conversations.json |
JSON | 10 | Multi-turn conversations |
cernis run <source> [OPTIONS]
--output, -o Output directory (default: ./output)
--config, -c Path to YAML config file
--clusters, -n Number of clusters (default: 50)
--min-size Minimum cluster size (default: 10)
--model, -m LLM model for summarization (default: gpt-4o-mini)
--embedder, -e Embedding model (default: text-embedding-3-small)
--algorithm, -a Clustering algorithm: kmeans, hdbscan (default: kmeans)
--no-cache Disable caching
--resume Resume from checkpoint
--quiet, -q Suppress progress output
cernis viz <data> [OPTIONS]
--output, -o Output HTML file (opens in browser if not specified)
--chart, -c Chart type: bubble, clio, hierarchy, dashboard, bar,
treemap, pie, sunburst (default: bubble)
--title, -t Chart title
--num, -n Number of clusters to display (default: 10)
cernis export <data> [OPTIONS]
--output, -o Output file path (default: report.html)
--format, -f Export format: html, csv, markdown (default: html)
cernis info <data>
Show summary statistics for cluster results
cernis version
Show version information
# Clone repository
git clone https://github.com/cernis/cernis-insights.git
cd cernis-insights
# Install with development dependencies
uv pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=cernis
# Linting
ruff check .
ruff format .- Python 3.10+
- OpenAI API key (for default summarizer and embedder)
- Optional: Anthropic API key (for Claude summarization)
MIT