You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Problem in the Indian FMCG & Wellness Industry
India's wellness and FMCG sector serves over 1.4 billion consumers across thousands of product lines — from Ayurvedic health supplements to packaged food to personal care. Every day, these companies receive thousands of customer complaints through phone calls, emails, web forms, and walk-ins. The current state of complaint handling is broken:
What happens today:
A customer calls to report a damaged product. They wait on hold for 8-12 minutes. A call center agent manually types the complaint into a spreadsheet. The agent guesses the priority. The complaint sits in a queue for hours or days before anyone reviews it.
An email complaint about an allergic reaction to a product — something that needs immediate attention — gets the same priority as a bulk pricing inquiry because no one reads every email in real time.
The same complaint patterns repeat across weeks and months, but no one connects the dots because there is no systematic classification or trend analysis.
The real cost of this broken process:
Problem
Impact
Manual classification
Agents misfile 15-25% of complaints — wrong category, wrong priority, wrong team
No priority triage
Critical complaints (allergic reactions, contamination) treated the same as low-priority inquiries
Phone call overhead
60-70% of call center time is spent on data entry, not resolution
No resolution guidance
Junior agents don't know what resolution steps to take — they escalate everything
Zero real-time visibility
Managers learn about complaint spikes days later, after SLA deadlines have passed
Language barriers
Indian consumers speak in mixed Hindi-English; existing systems can't handle code-switched speech
No audit trail
When a complaint is disputed, there's no timeline of who did what and when
What We Set Out to Solve
We asked a simple question: What if every customer complaint — whether spoken on a phone call or typed in a form — could be classified, prioritized, and given a specific resolution plan in under 5 seconds, with zero human intervention?
That question drove every architectural decision in SOLV.ai.
What Our Team Contributed
This project was built from scratch in a hackathon timeframe by a team of four second-year engineering students. Every line of code, every model decision, every system integration was designed, implemented, and tested by us:
Contribution
What Was Done
5 independent microservices
Each with its own API, tests, Docker config, and documentation
ONNX-accelerated NLP pipeline
Dual-model ensemble (DistilBERT + MiniLM) running at ~12ms per prediction
10-model LLM ablation study
120 API calls across 4 scenarios and 3 tasks, with auto-generated comparative graphs
Real-time voice agent
End-to-end phone call handling with STT, dialog extraction, classification, resolution, and ticket creation
Edge-deployable system
Entire stack runs offline on 4GB RAM — no cloud dependency required
Full-stack web application
Next.js 16 with role-based dashboards, SSE real-time updates, Prisma ORM, and Supabase
Real-time operational visibility instead of next-day reports
What SOLV.ai Does
SOLV.ai is a production-grade, multi-layer AI system that ingests customer complaints via voice, text, or web — classifies them with ONNX-accelerated NLP, generates empathetic resolution plans with LLMs, and persists everything to a role-based dashboard with SLA tracking.
The GenAI layer takes the NLP classifier's output (category, sentiment, priority) and uses the ablation-study-selected LLM to generate empathetic, actionable resolution recommendations.
Architecture
genai/
|
+-- main.py FastAPI server (4 endpoints)
+-- config.py Environment configuration
+-- llm.py Groq/OpenAI-compatible LLM client + LangSmith tracing
+-- models.py Pydantic schemas (ClassifierOutput, ResolutionResponse)
+-- prompts.py System + user prompt templates
+-- guardrails.py 4-layer security (sanitize, inject-detect, validate, parse)
+-- email_html.py Branded HTML email generator (solv.ai theme)
+-- run_server.py Startup script
+-- requirements.txt
|
+-- comparative_analysis/
+-- config.py 10-model configuration
+-- run_ablation.py 120 test cases across 4 scenarios x 3 tasks
+-- evaluate.py 5-metric auto-evaluation
+-- generate_report.py 8 comparative graphs
+-- test_scenarios.py Complaint test data
+-- prompts.py Task-specific prompts
+-- model_clients.py Multi-provider API clients
+-- graphs/ 8 PNG charts (auto-generated)
+-- results/ Raw JSON results
Processing Pipeline
RECEIVE Classifier output (complaint_id, text, category, sentiment, priority)
|
SANITIZE Strip control chars, escape HTML, truncate (2000 chars)
|
DETECT Scan for prompt injection (13 regex patterns)
|
VALIDATE Verify category/priority/sentiment are in valid ranges
|
BUILD Construct system + user prompt with complaint context
|
LLM CALL Groq Llama 3.3 70B (temp=0.25, max_tokens=2048)
|
PARSE Extract JSON (direct parse -> markdown fence -> first {...} block)
|
VALIDATE Cross-check: required fields, team validity, escalation rules
|
RESPOND Structured ResolutionResponse with SLA + metadata
Key Features
Feature
Description
Empathetic Responses
AI-generated customer messages tailored to complaint severity and sentiment
Role-Tagged Steps
Each action step assigned to a specific team (Support, QA, Logistics, Sales)
Full LLM call tracing — prompts, outputs, latency, token usage
API Endpoints
Method
Path
Description
GET
/health
Service health with LLM connectivity check
POST
/resolve
Full resolution from classifier output + customer metadata
POST
/resolve/quick
Shorthand — accepts raw classifier output
POST
/reply/email-html
Generate branded HTML email from resolution data
SLA Management
Priority
Response SLA
Resolution SLA
Escalation
High
1 hour
4 hours
Mandatory — immediate to senior management
Medium
4 hours
24 hours
If unresolved within 12 hours
Low
24 hours
72 hours
Only if complaint is repeated
2. NLP Text Classifier (text_classifier/)
Port: 8002 | Stack: FastAPI + ONNX Runtime + CUDA + VADER + scikit-learn
The NLP layer is the first stage of the processing pipeline — classifying raw complaint text into categories, computing sentiment, and predicting priority in ~12ms using ONNX-accelerated inference.
Why 50/50 Ensemble? Zero-shot NLI captures semantic reasoning (does the text logically entail the category?), while similarity captures surface-level pattern matching (is the text similar to known examples?). Both signals are complementary — the ensemble achieves 100% accuracy in our ablation tests.
How classification works:
DistilBERT-MNLI — For each category, constructs a hypothesis ("This text is about Packaging") and measures entailment probability via softmax over logits
MiniLM-L6 — Encodes text to 384-dim embedding, computes cosine similarity to reference embeddings, normalizes to probabilities
Ensemble — 50/50 weighted average of both probability distributions
VADER — Lexicon-based compound sentiment score with negation, intensifier, and punctuation rules
When a complaint is submitted via POST /api/complaints:
VALIDATE Zod schema validation (text, source, product_id, customer_name)
|
TRANSCRIBE If source="call" with audio_base64 -> STT service
|
CLASSIFY NLP service returns category, sentiment, priority
|
RESOLVE GenAI service generates resolution steps + customer response
|
PERSIST Insert into Supabase complaints table with all AI outputs
|
TIMELINE Create initial timeline entry
|
BROADCAST SSE push to all connected dashboard clients
|
ALERT If priority="High" -> broadcast high-priority alert
Frontend Stack
Technology
Purpose
Next.js 16 (App Router)
Server-side rendering, API routes, middleware
React 19
Component framework
TypeScript
Type safety
Tailwind CSS v4
Utility-first styling
Framer Motion
Page transitions and animations
GSAP
Scroll-triggered landing page animations
Lenis
Smooth scrolling
Recharts
Dashboard charts and analytics
Lucide React
Icon library
NextAuth v5
JWT authentication with role-based middleware
Prisma
Type-safe ORM for PostgreSQL
Zod
Runtime schema validation
API Endpoints
Method
Path
Description
POST
/api/auth/login
JWT authentication
POST
/api/auth/register
User registration
GET/POST
/api/complaints
List (paginated, filtered) / Create (triggers AI pipeline)
GET
/api/complaints/search
Full-text search
GET
/api/complaints/[id]
Complaint detail
GET
/api/analytics/dashboard
KPI aggregation
GET
/api/analytics/trends
Time-series analytics
GET
/api/export/complaints
CSV export
GET/PATCH
/api/admin/employees
Employee management
PATCH
/api/admin/sla-config
SLA configuration
GET
/api/sse/complaints
SSE complaint event stream
GET
/api/sse/notifications
SSE notification stream
POST
/api/webhooks/brevo
Brevo email webhook
Why This Architecture Was the Best Choice
Microservices Over Monolith — For a Hackathon
At first glance, microservices seem like overkill for a hackathon. Here's why it was the right call:
Factor
Monolith
Our Microservices
Why We Chose This
Parallel development
One codebase, merge conflicts
4 team members, 5 independent repos
Each team member worked on a separate service with zero conflicts
Language flexibility
Single runtime
Python for ML/AI, TypeScript for web
Used the right tool for each job — Python for ML inference, TypeScript for React
Independent deployment
All or nothing
Each service deploys independently
Failed service doesn't bring down the entire system
Offline fallback
Hard to isolate cloud dependencies
Each service has local fallback
Voice agent works fully offline (Ollama + Piper) when internet is unavailable
Testing
Integration tests for everything
Each service has isolated unit tests
Easier to validate correctness per-service
Demo flexibility
Must demo everything or nothing
Can demo any layer independently
Showed NLP accuracy independently, then voice pipeline, then full integration
Why ONNX Over Raw PyTorch for NLP
Factor
PyTorch Eager
ONNX Runtime
Inference latency
~35ms
~12ms (3x faster)
GPU memory
~800MB
~500MB (37% less)
Quantisation
Manual, complex
Built-in INT8/FP16
Deployment
Requires full PyTorch
Standalone runtime
Docker image size
~2GB+
~500MB
Why Groq Over Self-Hosted LLM for GenAI
Factor
Self-hosted (Ollama)
Groq API
Latency
2-4s (CPU)
1.4s
Quality (ablation-tested)
Good (phi3.5 1.5B)
96.9% (Llama 3.3 70B)
RAM requirement
~1.5GB
0 (API call)
Cost
Free
Free (Groq free tier)
Offline support
Yes
No
Our solution: Use Groq as primary, Ollama as automatic offline fallback. Best of both worlds.
Why Next.js 16 for the Website
Factor
Why
API routes
Backend logic (auth, complaint pipeline, analytics) colocated with frontend — no separate Express server needed
SSR
Dashboard pages server-rendered for fast initial load
Middleware
Role-based route protection at the edge layer
React 19
Latest React with improved performance and streaming
Prisma integration
Type-safe database queries generated from schema
LLM Ablation Study — Model Selection
10 models x 4 complaint scenarios x 3 tasks = 120 total API calls, auto-evaluated on 5 weighted metrics.
Models Evaluated
#
Model
Provider
Parameters
1
Llama 3.3 70B
Groq
70B
2
Qwen 2.5 72B
HuggingFace
72B
3
MiniMax M2.5
OpenRouter
—
4
Qwen 3.5 Plus
OpenRouter
—
5
Gemini 2.5 Flash
Google
—
6
MiniMax M2.7
OpenRouter
—
7
GLM 5 (Zhipu)
OpenRouter
—
8
GLM 5.1 (Zhipu)
OpenRouter
—
9
MiMo V2 Pro (Xiaomi)
OpenRouter
—
10
MiMo V2 Omni (Xiaomi)
OpenRouter
—
Evaluation Metrics
Metric
Weight
What It Measures
Classification Accuracy
30%
Correct complaint category
Priority Accuracy
25%
Correct urgency level (adjacent = 50% credit)
Resolution Quality
25%
Completeness and actionability of resolution steps
Format Compliance
10%
Valid JSON with all required schema fields
Response Quality
10%
Appropriate length, no refusals, coherent output
Results
Rank
Model
Overall Score
Avg Latency
Category Acc.
Priority Acc.
1
Llama 3.3 70B (Groq)
96.9%
1.4s
100%
88%
2
Qwen 2.5 72B (HuggingFace)
96.9%
11.6s
100%
88%
3
MiniMax M2.5
96.0%
13.5s
100%
88%
4
Qwen 3.5 Plus
92.3%
51.7s
92%
79%
5
Gemini 2.5 Flash
89.9%
7.4s
91%
82%
6
MiniMax M2.7
89.3%
15.1s
83%
88%
7-10
GLM 5/5.1, MiMo V2 Pro/Omni
32-49%
13-19s
33-58%
38-62%
Why Llama 3.3 70B on Groq?
Tied for highest accuracy (96.9%) with Qwen 2.5 72B
8x faster than the runner-up (1.4s vs 11.6s)
100% category accuracy — zero misclassification
100% format compliance — always valid JSON
Free tier — no API costs for production use
Consistent latency — no cold starts or spikes
Visual Results
Overall Score Comparison
Multi-Metric Radar
Response Latency
Per-Task Breakdown
Quality vs Speed (Efficiency Frontier)
Detailed Metric Heatmap
Token Usage
Scenario Accuracy
How to Reproduce
cd genai
export OPENCODE_API_KEY=your_key_here
# 1. Test API connectivity
python -m comparative_analysis.test_api_keys
# 2. Run the ablation study (120 API calls, ~10-15 min)
python -m comparative_analysis.run_ablation
# 3. Generate 8 graphs and markdown report
python -m comparative_analysis.generate_report
End-to-End Data Flow
Voice Call Path
User speaks -> Twilio Media Stream -> Orchestrator (mu-law -> PCM16)
-> STT Service (Whisper + VAD) -> transcribed text
-> Dialog Agent (LLM extraction) -> structured complaint
-> Classifier Service (ONNX ensemble) -> category + sentiment + priority
-> Resolve Agent (Groq Llama 3.3 70B) -> resolution steps
-> Backend API -> persisted to PostgreSQL
-> TTS (Piper/Edge) -> confirmation spoken to user
-> SSE broadcast -> dashboard updates in real-time
Web Form Path
User submits form -> Next.js API Route (/api/complaints)
-> (optional) STT if audio attached
-> NLP Classifier -> category + sentiment + priority
-> GenAI Resolution Engine -> resolution steps + customer response
-> Supabase/PostgreSQL -> persisted with timeline entry
-> SSE broadcast -> all connected dashboards receive update
-> High-priority alert -> pushed to operations dashboard
Dashboard View Path
Browser -> Next.js middleware (role check) -> role-specific dashboard page
-> API call to /api/complaints (filtered by role)
-> Prisma query -> Supabase/PostgreSQL
-> JSON response -> React components render
-> SSE connection -> real-time updates without page refresh
Performance Benchmarks
Operation
Latency
Notes
NLP classification (single)
~12ms
ONNX + CUDA, dual-model ensemble
GenAI resolution
~1.4s
Groq Llama 3.3 70B
STT transcription (GPU)
300-500ms
Faster-Whisper Tiny, INT8+FP16
STT transcription (CPU)
1-2s
CPU fallback
Voice agent (end-to-end, online)
2-3s/turn
Groq + Edge TTS
Voice agent (end-to-end, offline)
4-6s/turn
Ollama + Piper TTS
Dashboard API response
<100ms
Prisma + Supabase
SSE event propagation
<50ms
Server-Sent Events
Resource Requirements
Deployment
RAM
Disk
GPU
Full stack (online)
~2.8GB
~1.3GB
Optional
Full stack (offline)
~4.3GB
~2.3GB
Recommended
Website only
~500MB
~200MB
None
Per-Service Breakdown
Component
RAM
Disk
Offline?
STT (Whisper-tiny)
~300MB
~50MB
Yes
NLP Classifier (ONNX)
~200MB
~200MB
Yes
GenAI (API client)
~100MB
~10MB
Needs API
Orchestrator
~100MB
~5MB
Yes
Ollama (phi3.5)
~1.5GB
~1GB
Yes
Piper TTS
~100MB
~30MB
Yes
Website (Next.js)
~500MB
~200MB
Partial
Cost of Scaling & Real-World Implementation
Deployment Cost Breakdown
Small Scale (Startup / Pilot — 100 complaints/day)
# Cloud mode (with Twilio telephony)cd voice-agent && cp .env.example .env && docker-compose up -d
# Edge mode (fully offline, no telephony)
docker-compose -f docker-compose.edge.yml up -d
Test the Pipeline
# Test NLP classifier directly
curl -X POST http://localhost:8002/predict \
-H "Content-Type: application/json" \
-d '{"text": "My Parle-G biscuit packet was torn and the biscuits were all broken"}'# Test end-to-end via voice agent
curl -X POST http://localhost:8003/test/pipeline \
-H "Content-Type: application/json" \
-d '{"text": "Box was broken during delivery"}'# Test website complaint API
curl -X POST http://localhost:3000/api/complaints \
-H "Content-Type: application/json" \
-d '{ "text": "Product expired before printed date", "source": "email", "product_id": "pw-col-001", "customer_name": "Test User", "customer_email": "test@example.com" }'