Skip to content

cernis-intelligence/cernis-insights

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cernis-insights

Privacy-preserving pattern extraction from unstructured data at scale.

A production-ready pipeline that summarizes, embeds, and clusters text data to surface actionable insights without exposing PII. Supports conversations, support logs, LLM traces, user feedback, and any unstructured text.

Inspired by Anthropic's CLIO.

Features

  • Structured Extraction: Pydantic-based structured outputs with OpenAI and Anthropic APIs
  • Multiple Clustering Algorithms: K-means and HDBSCAN with automatic optimization
  • Privacy-First: PII stripping and minimum cluster size thresholds
  • Hierarchical Reduction: Multi-level cluster aggregation with auto-labeling
  • Interactive Visualizations: Bubble grids, hierarchy cards, treemaps, sunbursts, and dashboards
  • Checkpoint/Resume: Long-running pipelines can be paused and resumed
  • Caching: Embedding and summary caching to reduce API costs
  • Multiple Data Sources: JSON, JSONL, CSV, and HuggingFace datasets

Installation

Using uv (recommended):

uv pip install -e .

With all optional dependencies:

uv pip install -e ".[all]"

Or with pip:

pip install -e .

Environment Variables

Copy .env.example to .env and configure:

# Required for summarization and embeddings
OPENAI_API_KEY=sk-...

# Optional: for Claude-based summarization
ANTHROPIC_API_KEY=sk-ant-...

# Optional: for private HuggingFace datasets
HF_TOKEN=hf_...

Quick Start

Python API

import asyncio
from cernis import Pipeline

async def main():
    pipeline = Pipeline()
    results = await pipeline.run("./data/conversations.jsonl")

    # Print cluster hierarchy
    results.print_hierarchy()

    # Export to JSON
    results.export("./output/clusters.json")

asyncio.run(main())

CLI

# Run the pipeline
cernis run ./data/conversations.jsonl --output ./output

# With custom settings
cernis run ./data/traces.jsonl \
    --clusters 100 \
    --model gpt-4o-mini \
    --embedder text-embedding-3-small \
    --algorithm kmeans

# Use Claude for summarization
cernis run ./data/tickets.jsonl --model claude-sonnet-4-20250514

# Resume from checkpoint
cernis run ./data/large_dataset.jsonl --resume

Pipeline Architecture

Input          Summarize       Embed          Cluster        Reduce
  |                |              |              |              |
JSON/CSV       Structured     OpenAI or      K-means or     Hierarchical
JSONL          Extraction     Local          HDBSCAN        Aggregation
HF Datasets    (Pydantic)     Embeddings                    + Labeling

Stage 1: Ingest

Load data from multiple sources with automatic format detection.

from cernis import Pipeline

pipeline = Pipeline()

# Local files
results = await pipeline.run("./data/tickets.jsonl")
results = await pipeline.run("./data/logs.json")
results = await pipeline.run("./data/feedback.csv")

# HuggingFace datasets
from cernis.ingest import load_hf_dataset

records = load_hf_dataset(
    "lmsys/lmsys-chat-1m",
    content_field="conversation",
    max_records=10000
)
results = await pipeline.run(records)

Recommended public datasets:

  • lmsys/lmsys-chat-1m - 1M real LLM conversations
  • allenai/WildChat-1M - 1M ChatGPT conversations
  • HuggingFaceH4/ultrachat_200k - 200K dialogues
  • tatsu-lab/alpaca - 52K instruction examples

Stage 2: Summarize

Extract structured information using Pydantic models with OpenAI or Anthropic.

from cernis.summarize import OpenAISummarizer, AnthropicSummarizer

# OpenAI (default)
summarizer = OpenAISummarizer(model="gpt-4o-mini")

# Anthropic
summarizer = AnthropicSummarizer(model="claude-sonnet-4-20250514")

Structured extraction output:

{
    "summary": "User requested password reset...",
    "intent": "Account recovery",
    "topic": "Authentication",
    "outcome": "success",  # success | partial | failure | unknown
    "sentiment": "neutral",  # positive | neutral | negative
    "key_points": ["Password expired", "Reset link sent"]
}

Stage 3: Embed

Generate vector embeddings for semantic clustering.

from cernis.embed import OpenAIEmbedder, LocalEmbedder

# OpenAI API
embedder = OpenAIEmbedder(model="text-embedding-3-small")

# Local model (no API required)
embedder = LocalEmbedder(model="all-MiniLM-L6-v2")

Stage 4: Cluster

Group similar records using clustering algorithms.

from cernis.cluster import KMeansClusterer, HDBSCANClusterer

# K-means (fixed number of clusters)
clusterer = KMeansClusterer(n_clusters=50)

# HDBSCAN (automatic cluster detection)
clusterer = HDBSCANClusterer(min_cluster_size=10)

Stage 5: Reduce

Aggregate clusters into hierarchical groups with automatic labeling.

from cernis.reduce import HierarchicalReducer

reducer = HierarchicalReducer(
    levels=3,
    min_clusters_per_level=5
)

Configuration

Python API

from cernis import Pipeline, PipelineConfig
from cernis.core.config import (
    SummarizerConfig,
    EmbedderConfig,
    ClustererConfig,
    PrivacyConfig,
    CacheConfig,
)

config = PipelineConfig(
    summarizer=SummarizerConfig(
        provider="openai",
        model="gpt-4o-mini",
        max_tokens=500,
    ),
    embedder=EmbedderConfig(
        provider="openai",
        model="text-embedding-3-small",
    ),
    clusterer=ClustererConfig(
        algorithm="kmeans",
        n_clusters=50,
    ),
    privacy=PrivacyConfig(
        strip_pii=True,
        min_cluster_size=10,
    ),
    cache=CacheConfig(
        enabled=True,
        cache_dir=".cernis_cache",
    ),
)

pipeline = Pipeline(config=config)

YAML Configuration

# cernis.yaml
summarizer:
  provider: openai
  model: gpt-4o-mini
  max_tokens: 500

embedder:
  provider: openai
  model: text-embedding-3-small

clusterer:
  algorithm: kmeans
  n_clusters: 50

privacy:
  strip_pii: true
  min_cluster_size: 10

cache:
  enabled: true
  cache_dir: .cernis_cache

Use with CLI:

cernis run ./data/tickets.jsonl --config cernis.yaml

Visualization

CLI Commands

# Bubble grid (CLIO-style percentages)
cernis viz ./output/clusters.json --chart bubble -n 10

# Hierarchical cards
cernis viz ./output/clusters.json --chart hierarchy -n 5

# Full CLIO report (bubble grid + details)
cernis viz ./output/clusters.json --chart clio --title "Support Analysis"

# Dashboard with multiple panels
cernis viz ./output/clusters.json --chart dashboard

# Individual charts
cernis viz ./output/clusters.json --chart bar -n 10
cernis viz ./output/clusters.json --chart treemap
cernis viz ./output/clusters.json --chart sunburst
cernis viz ./output/clusters.json --chart pie -n 8

# Export to HTML file
cernis viz ./output/clusters.json --chart clio -o report.html

Python API

from cernis.viz import ClusterVisualizer, plot_clusters

# From pipeline results
viz = ClusterVisualizer(result=results)

# Or from exported JSON
viz = ClusterVisualizer(data_path="./output/clusters.json")

# CLIO-style visualizations
viz.bubble_grid(n=10, title="Top Use Cases")
viz.clio_report(n_bubbles=10, title="Analysis Report")

# Hierarchical visualization
viz.hierarchy_cards(n_groups=5)
viz.to_hierarchy_html("hierarchy.html", n_groups=5)

# Standard charts
viz.bar_chart(n=10)
viz.treemap()
viz.sunburst()
viz.pie_chart(n=8)
viz.dashboard()

# Export all to HTML
viz.to_html("report.html")

Privacy Features

from cernis import Pipeline
from cernis.core.config import PipelineConfig, PrivacyConfig

config = PipelineConfig(
    privacy=PrivacyConfig(
        strip_pii=True,        # Remove PII before processing
        min_cluster_size=10,   # Suppress small clusters
    )
)

pipeline = Pipeline(config=config)

PII patterns detected and stripped:

  • Email addresses
  • Phone numbers (US and international formats)
  • Social Security Numbers
  • Credit card numbers
  • IP addresses
  • API keys and tokens
  • AWS access keys
  • JWTs

Caching

Reduce API costs by caching embeddings and summaries:

from cernis import Pipeline
from cernis.core.config import PipelineConfig, CacheConfig

config = PipelineConfig(
    cache=CacheConfig(
        enabled=True,
        cache_dir=".cernis_cache",
        ttl_seconds=86400 * 7,  # 7 days
    )
)

pipeline = Pipeline(config=config)

Disable caching via CLI:

cernis run ./data/tickets.jsonl --no-cache

Checkpoint and Resume

For large datasets, the pipeline supports checkpointing:

# Start processing (creates checkpoint on interrupt)
cernis run ./data/large_dataset.jsonl --output ./output

# Resume from checkpoint
cernis run ./data/large_dataset.jsonl --output ./output --resume

Example Datasets

The examples/data/ directory contains sample datasets for testing:

File Format Records Description
support_tickets.jsonl JSONL 30 Support tickets with categories
support_tickets.json JSON 30 Same data in JSON array format
support_tickets.csv CSV 30 Same data in CSV format
llm_traces.jsonl JSONL 20 LLM conversation traces
conversations.json JSON 10 Multi-turn conversations

CLI Reference

cernis run <source> [OPTIONS]
    --output, -o       Output directory (default: ./output)
    --config, -c       Path to YAML config file
    --clusters, -n     Number of clusters (default: 50)
    --min-size         Minimum cluster size (default: 10)
    --model, -m        LLM model for summarization (default: gpt-4o-mini)
    --embedder, -e     Embedding model (default: text-embedding-3-small)
    --algorithm, -a    Clustering algorithm: kmeans, hdbscan (default: kmeans)
    --no-cache         Disable caching
    --resume           Resume from checkpoint
    --quiet, -q        Suppress progress output

cernis viz <data> [OPTIONS]
    --output, -o       Output HTML file (opens in browser if not specified)
    --chart, -c        Chart type: bubble, clio, hierarchy, dashboard, bar,
                       treemap, pie, sunburst (default: bubble)
    --title, -t        Chart title
    --num, -n          Number of clusters to display (default: 10)

cernis export <data> [OPTIONS]
    --output, -o       Output file path (default: report.html)
    --format, -f       Export format: html, csv, markdown (default: html)

cernis info <data>
    Show summary statistics for cluster results

cernis version
    Show version information

Development

# Clone repository
git clone https://github.com/cernis/cernis-insights.git
cd cernis-insights

# Install with development dependencies
uv pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=cernis

# Linting
ruff check .
ruff format .

Requirements

  • Python 3.10+
  • OpenAI API key (for default summarizer and embedder)
  • Optional: Anthropic API key (for Claude summarization)

License

MIT

About

Privacy-preserving pattern extraction from unstructured data at scale.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors