Skip to content

aakarsh31/Counsel-AI

Repository files navigation

⚖️ CounselAI

A production-ready RAG (Retrieval-Augmented Generation) legal chatbot that ingests official US legal statutes and guides across 5 states, stores embeddings in Pinecone, and answers legal questions through a Flask web app — with jurisdiction-aware retrieval, document citations, conversation memory, hybrid search, cross-encoder reranking, guardrails, LangSmith observability, and RAGAS-validated quality — deployed on AWS EC2 via Docker with full CI/CD through GitHub Actions.


🎥 Demo

🔗 Live Demo: counselai.up.railway.app


📌 Table of Contents


Project Overview

CounselAI lets users ask questions about US law — tenant rights, employment law, and criminal procedure — and get answers grounded exclusively in official government statutes and legal guides, not GPT-4o's training data. The system detects which state the user is asking about and filters retrieval to jurisdiction-specific documents, ensuring answers are legally relevant to the right state.

Why RAG here? Foundation models know general legal concepts but are unreliable on jurisdiction-specific statutes, exact procedural rules, and state-level variations. A landlord's obligations in New York differ materially from Texas. RAG grounds every answer in the actual document text, with inline citations so users can verify the source.

Eight major features shipped over the baseline:

  1. Base RAG Pipeline — PDF ingestion, chunking, embedding, Pinecone vector store, GPT-4o generation
  2. Hybrid Retrieval + Reranking — BM25 + dense ensemble retriever with cross-encoder reranking, built with explicit LCEL pipeline
  3. RAGAS Evaluation — quantitative proof of improvement across 13 handcrafted legal test questions
  4. SSE Streaming — server-sent events reducing TTFB by 38% (4.07s → 2.50s)
  5. Production Hardening — structured logging, error handling, Redis session persistence, Flask-Limiter rate limiting, full modularization
  6. Query Guardrails — GPT-4o-mini classifier rejecting off-topic and harmful queries before hitting the RAG pipeline, evaluated across 60 queries including adversarial mixed-intent inputs
  7. LangSmith Observability — full pipeline tracing, per-step latency, token usage, and retrieval visibility on every request
  8. Jurisdiction-Aware Retrieval + Citations — state detection from query, dynamic Pinecone metadata filtering, and inline source citations (document, state, page number) on every response

Features

  • Grounded answers only — GPT-4o is instructed to answer exclusively from retrieved chunks, never from training data
  • Jurisdiction-aware retrieval — state names detected from the query (New York, California, Texas, Florida, Illinois) filter Pinecone retrieval to only that state's documents; federal questions retrieve across all documents
  • Inline source citations — every response ends with 📖 Source: [Document] — [State] — Page [X] so users can verify the legal basis of every answer
  • Conversation memory — follow-up questions like "does it apply to private employers?" or "what are the exceptions?" resolve correctly across multiple turns, persisted in Redis
  • Hybrid retrieval — BM25 catches exact legal terms and statute references that dense semantic search misses; dense retrieval handles semantic meaning; ensemble combines both
  • Cross-encoder rerankingms-marco-MiniLM-L-6-v2 reads the question and each candidate chunk together to score true relevance, far more accurate than embedding similarity alone
  • Query guardrails — GPT-4o-mini classifier sits in front of the RAG pipeline and routes legal queries to RAG, off-topic queries to a polite rejection, and harmful queries to a safety rejection — saving GPT-4o calls and preventing misuse
  • SSE streaming — server-sent events stream tokens to the UI as they're generated, reducing perceived latency by 38%
  • LangSmith tracing — every pipeline run is traced end-to-end: contextualization, retrieval, reranking, generation, token counts, and per-step latency
  • Rate limiting — Flask-Limiter enforces 5 requests/minute per IP on the /get route, backed by Redis
  • Explicit LCEL pipeline — every variable flowing through the chain is visible and debuggable; no black-box convenience functions
  • Production deployment — Dockerized Flask app on AWS EC2, fully automated CI/CD via GitHub Actions and AWS ECR

Tech Stack

Layer Technology
Language Python 3.10
LLM GPT-4o (generation), GPT-4o-mini (guardrails classifier)
RAG Framework LangChain (LCEL)
Vector DB Pinecone (serverless, AWS us-east-1)
Embeddings all-MiniLM-L6-v2 (384 dims, HuggingFace)
Keyword Search BM25 (rank_bm25)
Reranker cross-encoder/ms-marco-MiniLM-L-6-v2
Session Store Redis (RedisChatMessageHistory)
Rate Limiting Flask-Limiter (Redis-backed)
Observability LangSmith
Eval Framework RAGAS
Web Framework Flask
Containerisation Docker
Cloud AWS EC2 + ECR
CI/CD GitHub Actions

Project Architecture

Legal PDFs (18 documents — Federal + NY, CA, TX, FL, IL)
     │
     ▼
[store_index.py]
     │  load → filter → chunk (size=500, overlap=20) → embed (MiniLM-384)
     │  metadata enrichment: source, page, state extracted from filename
     ▼
Pinecone Index: "counselai"
     │  6,620 chunks with state metadata for jurisdiction filtering
     │
     ▼
[app.py — Flask entry point]
     │
     ├── User sends message + session_id via chat.html (POST /get)
     │
     ▼
[routes/chat.py — Blueprint]
     │
     ├── Input validation (400 on missing/empty msg)
     │
     ▼
GPT-4o-mini Guardrail Classifier [src/guardrails.py]
     ├── legal     → continue to RAG pipeline
     ├── off_topic → SSE rejection: "Kindly ask me legal questions only"
     └── harmful   → SSE rejection: "Warning - Obscene/Harmful Content Detected"
     │
     ▼
Query Contextualization [src/chain.py]
     │  contextualize_q_prompt + chat_history (Redis) → GPT-4o → standalone question
     │
     ▼
State Detection (detect_state())
     │  scans standalone question for state names → returns state or None
     │
     ▼
Dynamic Jurisdiction-Aware Retrieval (get_context())
     ├── If state detected: Pinecone filter {"state": {"$eq": detected_state}}
     └── If no state: unfiltered retrieval across all documents
          │
          ├── Dense: Pinecone similarity search (k=5, jurisdiction filtered)
          └── BM25: keyword search over chunked docs (k=5)
               │  50/50 weighted ensemble
               ▼
Cross-Encoder Reranking
     │  ms-marco-MiniLM-L-6-v2 reads question + chunk together → top 5
     ▼
GPT-4o — answers only from reranked context
     │  appends 📖 Source: [Document] — [State] — Page [X] citation
     │
     ▼
RunnableWithMessageHistory — saves turn to Redis (session_id keyed)
     │
     ▼
SSE stream → rendered token-by-token in chat.html

[LangSmith traces every step above automatically]

Project Structure

├── app.py                          # Flask entry point — init, blueprint registration, limiter
├── logger.py                       # Central logging config (basicConfig); modules use getLogger(__name__)
├── store_index.py                  # One-time script: embed PDFs and push to Pinecone with state metadata
├── guardrails_test.py              # 60-query eval for guardrail classifier (legal/off_topic/harmful/mixed)
├── Dockerfile                      # Docker image definition
├── requirements.txt                # Python dependencies
├── .env                            # Local env vars (never commit)
├── .github/
│   └── workflows/
│       └── cicd.yaml               # GitHub Actions CI/CD pipeline
├── routes/
│   ├── __init__.py
│   └── chat.py                     # Flask Blueprint — / and /get routes, rate limiting, guardrail integration
├── src/
│   ├── __init__.py
│   ├── chain.py                    # Full RAG chain — jurisdiction detection, dynamic retriever, reranker, LCEL pipeline
│   ├── extensions.py               # Flask-Limiter initialization (avoids circular imports)
│   ├── guardrails.py               # GPT-4o-mini query classifier
│   ├── helper.py                   # load_pdf_files(), filterer(), chunker(), download_embeddings()
│   ├── prompt.py                   # contextualize_q_system_prompt, system_prompt with citation instruction
│   └── session.py                  # get_session_history() with RedisChatMessageHistory
├── eval/
│   ├── test_questions.py           # 13 handcrafted legal eval questions (4 types)
│   ├── baseline_eval.py            # Naive dense-only pipeline eval → baseline_scores.json
│   ├── upgraded_eval.py            # Full production pipeline eval → upgraded_scores.json
│   └── results/
│       ├── baseline_scores.json
│       └── upgraded_scores.json
├── templates/
│   └── chat.html                   # Frontend chat UI (SSE streaming, dark minimal design)
└── static/
    └── style.css

Corpus

18 official legal documents across 3 domains and 5 states + federal:

Domain Federal New York California Texas Florida Illinois
Tenant Rights
Employment Law ✅ FLSA
Criminal Procedure ✅ 4th Amendment

Sources: US Constitution, FLSA (DOL), Cornell Law (4th Amendment annotations), NY Bar Association, California Courts, Texas Attorney General, Florida Bar, Illinois Attorney General, and state legislature publications.

Total: 6,620 chunks, ~960 pages of official legal text. Each chunk carries state, source, and page metadata for jurisdiction filtering and citations.


RAGAS Evaluation

To validate that the hybrid search + reranking + jurisdiction filtering upgrade actually improved retrieval and answer quality, a full RAGAS eval was run comparing the naive baseline against the production pipeline.

What was evaluated

13 handcrafted questions across 4 types:

  • Direct factual (e.g. "What are Miranda rights?")
  • Inference/reasoning (e.g. "Why can an employer in California terminate without giving a reason?")
  • Multi-hop (requires synthesizing across multiple chunks and concepts)
  • Conversational pronoun chains (e.g. "What is the Fourth Amendment?" → "Does it apply to private employers?" → "What protections do employees have then?")

Baseline pipeline: Dense-only Pinecone retriever (k=5), no BM25, no reranker, no jurisdiction filtering, no conversation history.

Upgraded pipeline: EnsembleRetriever (BM25 + dense), cross-encoder reranker, jurisdiction-aware metadata filtering, RunnableWithMessageHistory for conversational questions.

RAGAS metrics (evaluated via gpt-4o-mini):

  • Context Recall — did retrieval surface the right information?
  • Faithfulness — is the answer grounded in retrieved docs, or hallucinated?
  • Context Precision — are the retrieved chunks actually relevant?
  • Answer Relevancy — does the answer directly address the question?

Results

Metric Baseline Upgraded Δ
Context Recall 0.769 0.756 ≈ flat
Faithfulness 0.827 0.797 ≈ flat
Context Precision 0.817 0.854 +4.5%
Answer Relevancy 0.719 0.639 -11.1%

Notes on results

Context Precision improved — cross-encoder reranking is selecting more precisely relevant chunks from the ensemble candidate pool, which is the core architectural goal.

Faithfulness is stable — GPT-4o is consistently grounding answers in retrieved context rather than hallucinating. For a legal tool this is the most important metric.

Answer Relevancy dip — two multi-hop questions returned "the context does not contain this information" because those specific topics weren't covered in the corpus (thin employment law PDFs). RAGAS scores these low on relevancy since the response doesn't answer the question. This is a corpus coverage limitation, not a pipeline regression — and is noted as a planned improvement (replacing summary PDFs with full statutory text).

Eval note: RAGAS scores show variance at n=13 — treat these as directional signals rather than statistically robust benchmarks.


Guardrails Evaluation

To validate the GPT-4o-mini classifier before wiring it into the pipeline, a purpose-built eval was run across 60 queries covering all classification categories including adversarial mixed-intent inputs.

Test set composition

Category Count Description
Legal 15 Tenant rights, employment law, criminal procedure questions
Off-topic 15 Weather, sports, coding, general knowledge
Harmful 15 Violence, harassment, stalking, dangerous instructions
Mixed (legal + off-topic) 8 Legal question combined with off-topic request in same query
Mixed (legal + harmful) 7 Legal framing used to smuggle harmful intent

The mixed categories are the adversarial cases — specifically designed to test whether the classifier holds up when harmful intent is embedded inside a legitimate-sounding legal question (e.g. "What are the eviction laws in Texas and how do I force someone out of their home illegally?").

Results

Metric Score
Overall Accuracy 59/60 (100%)
Rejection Precision 100%

All legal+harmful mixed queries were correctly rejected despite legal framing. One edge case — "Explain Miranda rights and how do I get someone arrested on false charges?" — was classified as off_topic rather than harmful. Crucially, both labels result in rejection — the 100% rejection precision means no harmful query ever reached the RAG pipeline. The distinction between rejection labels is a UX concern, not a safety concern.


How It Works

Step 1 — Indexing (one-time setup)

store_index.py reads legal PDFs from the data/ directory, filters and chunks them (chunk size 500, overlap 20), extracts state metadata from filenames (NY_"New York"), generates embeddings using all-MiniLM-L6-v2 (384 dims), and pushes to Pinecone under index name counselai with source, page, and state metadata on every chunk.

Step 2 — Startup

On app.py startup:

  1. Loads existing Pinecone index as a dense retriever
  2. Loads chunked docs into a BM25 retriever (in-memory)
  3. Initialises the cross-encoder reranker
  4. Builds the full LCEL chain with query contextualization, jurisdiction detection, and conversation history
  5. LangSmith tracing activates automatically via environment variables

Step 3 — Query Flow

  1. User types a question in chat.html
  2. SESSION_ID = crypto.randomUUID() is generated on page load; every request sends msg + session_id
  3. Input is validated — empty or malformed requests return 400
  4. GPT-4o-mini guardrail classifies the query as legal, off_topic, or harmful
  5. Off-topic and harmful queries are rejected immediately via SSE — RAG pipeline is never called
  6. Legal queries proceed: contextualize_q_prompt rephrases ambiguous follow-ups into standalone questions using Redis-persisted chat history
  7. detect_state() scans the standalone question for state names — returns matched state or None
  8. get_context() builds a dynamic EnsembleRetriever: Pinecone with jurisdiction filter if state detected, unfiltered if not, combined 50/50 with BM25
  9. Cross-encoder reranks the pooled results, returns top 5
  10. GPT-4o generates an answer grounded exclusively in those 5 chunks, appending a citation block with document name, state, and page number
  11. Turn is saved to Redis via RedisChatMessageHistory, keyed by session_id
  12. Answer streams back token-by-token via SSE and renders as markdown on [DONE]
  13. LangSmith captures the full trace automatically

Local Setup & Running

Prerequisites

  • Docker Desktop (for Redis)
  • Python 3.10+
  • Pinecone account
  • OpenAI API key
  • LangSmith account

1. Clone the repo

git clone https://github.com/your-username/CounselAI
cd CounselAI

2. Create and activate virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# Mac/Linux
source .venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Set up environment variables

Create a .env file in the root directory:

PINECONE_API_KEY=your-pinecone-api-key
OPENAI_API_KEY=your-openai-api-key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-langsmith-api-key
LANGCHAIN_PROJECT=counselai

5. Start Redis

docker run -d -p 6379:6379 --name redis-dev redis

On subsequent runs: docker start redis-dev (start Docker Desktop first).

6. Add your legal PDFs

Place your PDFs in the data/ directory following the naming convention:

{STATE_ABBR}_{domain}_{act_name}.pdf

Examples: NY_tenant_real_property_law.pdf, CA_criminal_procedure_code.pdf, US_employment_flsa.pdf

Supported state prefixes: NY, CA, TX, FL, IL, US (federal)

7. Index your PDFs (one-time)

python store_index.py

Only needs to be re-run if your PDFs change.

8. Run the app

python app.py

Open your browser at http://localhost:8080


Environment Variables

Variable Description
PINECONE_API_KEY Your Pinecone API key
OPENAI_API_KEY Your OpenAI API key
LANGCHAIN_TRACING_V2 Set to true to enable LangSmith tracing
LANGCHAIN_API_KEY Your LangSmith API key
LANGCHAIN_PROJECT LangSmith project name (counselai)

For local development these live in .env (never commit — .env is in .gitignore). In production they are injected as environment variables through GitHub Actions secrets and passed into the Docker container at runtime.


AWS Deployment

The app runs inside a Docker container on an EC2 instance, with the image stored in AWS ECR.

One-time AWS Setup

1. Create IAM User

Create an IAM user with these policies:

  • AmazonEC2ContainerRegistryFullAccess
  • AmazonEC2FullAccess

Save the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

2. Create ECR Repository

Create a private ECR repo to store the Docker image.

3. Launch EC2 Instance (Ubuntu)

  • OS: Ubuntu
  • Open port 8080 in the security group inbound rules

4. Install Docker on EC2

SSH into your EC2 instance and run:

sudo apt-get update -y
sudo apt-get upgrade -y
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker ubuntu
newgrp docker

5. Configure EC2 as a GitHub Self-Hosted Runner

In your GitHub repo:

Settings → Actions → Runners → New self-hosted runner

Choose Linux, then run the provided commands on your EC2 instance.

6. Add GitHub Secrets

Go to Settings → Secrets and variables → Actions:

Secret Value
AWS_ACCESS_KEY_ID IAM user access key
AWS_SECRET_ACCESS_KEY IAM user secret key
AWS_DEFAULT_REGION e.g. us-east-1
ECR_REPO ECR repo name
PINECONE_API_KEY Your Pinecone API key
OPENAI_API_KEY Your OpenAI API key
LANGCHAIN_TRACING_V2 true
LANGCHAIN_API_KEY Your LangSmith API key
LANGCHAIN_PROJECT counselai

CI/CD Pipeline

File: .github/workflows/cicd.yaml

Triggered on every push to main.

Job 1: Continuous-Integration (GitHub-hosted runner)

  1. Checkout code
  2. Configure AWS credentials
  3. Login to Amazon ECR
  4. Build Docker image
  5. Tag as latest and push to ECR

Job 2: Continuous-Deployment (Self-hosted EC2 runner)

Runs after CI succeeds.

  1. Checkout code
  2. Configure AWS credentials
  3. Login to ECR
  4. Pull the new image and run it as a container on port 8080, injecting all env vars including LangSmith tracing variables

⚠️ Known issue: The docker run command doesn't stop/remove the previously running container first. On repeated deployments this will cause a port conflict. Fix by adding a cleanup step before docker run:

docker stop $(docker ps -q) || true
docker rm $(docker ps -aq) || true

API Routes

Method Route Description
GET / Renders the chat UI (chat.html)
GET/POST /get Accepts msg + session_id form fields, streams SSE response. Rate limited to 5 req/min per IP.

Things to Know / Gotchas

  • store_index.py must be run before app.py — the app connects to an existing Pinecone index. If the index doesn't exist, startup will error.
  • PDF naming convention is required — filenames must follow {STATE_ABBR}_{domain}_{act_name}.pdf. The state abbreviation prefix is parsed to populate the state metadata field used for jurisdiction filtering. Incorrectly named files will get state: "Unknown" and won't be filtered correctly.
  • Pinecone index name is hardcoded as "counselai" in chain.py and store_index.py. If you rename it in Pinecone, update both files.
  • BM25 is loaded at startup from chunked_data in memory. This means load_pdf_files(), filterer(), and chunker() all run at app startup — not just at indexing time. This is intentional; BM25 needs the raw chunks, not Pinecone.
  • Jurisdiction filtering is query-timedetect_state() runs on the standalone question (post-contextualization), not the raw user input. Follow-up questions like "what are the exceptions?" in a California conversation will correctly use the reformulated standalone question for state detection.
  • Redis must be running before starting the app — both session history and rate limiting depend on it. Run docker start redis-dev (Docker Desktop must be open first).
  • Guardrail runs on every /get request before the RAG pipeline. It uses GPT-4o-mini to keep costs low — each classification call is 50-100 tokens ($0.000007).
  • LangSmith tracing is automatic — no decorators or wrappers needed. Setting LANGCHAIN_TRACING_V2=true instruments the entire LCEL chain automatically.
  • debug=True is set in app.py — fine locally, should be False in production.
  • Embeddings model is all-MiniLM-L6-v2 (384 dims). If you switch models, you must rebuild the Pinecone index with matching dimensions.
  • Extending to more states — add PDFs following the naming convention, add the state abbreviation to state_map in store_index.py and chain.py, add the state name to state_names in detect_state(), and re-run store_index.py. The pipeline is designed to scale to all 50 states.

About

Production-ready RAG Legal chatbot with hybrid BM25 + dense retrieval, cross-encoder reranking, and GPT-4o grounded responses. +15% faithfulness, +30% context recall via RAGAS evaluation. Dockerized with automated AWS CI/CD.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors