A sophisticated Agentic RAG (Retrieval-Augmented Generation) system tailored for Indian legal research, capable of understanding complex legal queries, synthesizing case law, and providing actionable legal advice.
- Intelligent Legal Research: Automated scraping of judgments from eCourts portal
- Multi-Modal Data Processing: Handles both online judgments and offline legal PDFs
- Advanced RAG Pipeline: Combines semantic search (ChromaDB) with knowledge graphs (Neo4j)
- Agentic Reasoning: Uses LangChain/LangGraph for complex query orchestration
- Legal Entity Recognition: Extracts judges, acts, precedents, and case relationships
- CAPTCHA Handling: Integrated EasyOCR for automated form solving
- Automation & Scraping: Python, Selenium, undetected-chromedriver, EasyOCR
- Document Processing: PyMuPDF, pdfplumber, LangChain text splitters, spaCy NLP
- Databases: ChromaDB (Vector store), Neo4j (Knowledge Graph)
- AI & Orchestration: LangChain/LangGraph, Groq/OpenAI/Anthropic/Gemini APIs
- UI: Streamlit/Gradio chatbot interface, Flask web app with Web3 integration
-
Clone the repository
git clone <repository-url> cd law-aware-ai
-
Install dependencies
uv sync
-
Set up environment variables
cp .env.example .env # Edit .env with your API keys and database credentials -
Install additional NLP models (optional)
python -m spacy download en_core_web_sm
# Scrape judgments from eCourts
python main.py scrape --case-number "CRL.A. 123/2023" --court "Delhi High Court" --max-pages 5
# Process PDF documents
python main.py process --pdf-file data/pdfs/sample.pdf
python main.py process --pdf-dir data/pdfs/
# Store processed documents in vector database
python main.py store --processed-dir data/processed/
python main.py store --json-file data/processed/sample.json
# Populate and query knowledge graph
python main.py graph --populate --processed-dir data/processed/
python main.py graph --query "cases by judge Justice Singh"
# Run intelligent legal research agent
python main.py agent --query "Explain Section 302 IPC interpretation in murder cases"
python main.py agent --query "Compare death penalty approaches in Indian courts" --verbose
# Start web interface with Mars chatbot
python main.py web
python main.py web --host 127.0.0.1 --port 8000 --debug# Phase 1-2 Demo: Data processing and vector storage
python demo_phase2.py
# Phase 3 Demo: Knowledge graph construction
python demo_phase3.py
# Phase 4 Demo: Agentic RAG pipeline
python demo_phase4.py- Project structure with modular src/ layout
- uv-based dependency management
- Pydantic settings with environment variables
- CLI interface with subcommands
- eCourts web scraping with CAPTCHA handling
- PDF text extraction and entity recognition
- Legal-aware text chunking for semantic search
- ChromaDB vector storage with metadata
- Neo4j graph database integration
- Legal relationship modeling (cases, judges, acts, precedents)
- Graph population pipeline from processed documents
- Natural language graph querying
- Intelligent query routing (vector/graph/hybrid/complex)
- LangGraph orchestration for multi-step reasoning
- Legal entity extraction and context enrichment
- Comprehensive answer synthesis from multiple sources
- Confidence scoring and reasoning transparency
- Streamlit/Gradio chatbot interface
- Conversation memory and context management
- Production deployment with monitoring
python main.py scrape --case-number "CRL.A. 123/2023" --court "Delhi High Court"
python main.py process --pdf-dir data/judgments/
### Development
```bash
# Install dev dependencies
uv sync --dev
# Run tests
pytest
# Format code
black src/
law-aware-ai/
├── config/
│ ├── settings.py # Configuration management
│ └── __init__.py
├── data/
│ ├── judgments/ # Raw judgment PDFs
│ ├── processed/ # Processed text data
│ ├── vector_db/ # ChromaDB storage
│ └── graph_db/ # Neo4j data
├── src/
│ ├── data_processing/
│ │ ├── ecourts_scraper.py # eCourts portal scraper
│ │ └── pdf_processor.py # PDF text extraction
│ ├── vector_store/ # ChromaDB integration
│ ├── graph_store/ # Neo4j knowledge graph
│ ├── agent/ # LangChain orchestration
│ └── ui/ # Streamlit/Gradio interfaces
├── main.py # CLI entry point
├── pyproject.toml # Project configuration
├── .env.example # Environment template
└── README.md
- Data Acquisition: Scrape eCourts portal and process offline PDFs
- Text Processing: Extract and clean legal text, identify sections
- Entity Extraction: Use NLP to identify legal entities and relationships
- Vector Storage: Chunk and embed text in ChromaDB for semantic search
- Knowledge Graph: Store relationships in Neo4j for structural queries
- Agent Orchestration: Route queries to appropriate data sources
- Response Synthesis: Generate comprehensive legal answers
- Factual Queries: "What is the judgment in case X?" → ChromaDB
- Relational Queries: "How does case X relate to case Y?" → Neo4j
- Complex Queries: Multi-source reasoning with LLM synthesis
- Phase 1: Data acquisition and processing ✅
- Phase 2: Vector storage and semantic search
- Phase 3: Knowledge graph construction
- Phase 4: Agentic RAG implementation
- Phase 5: UI development and deployment
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This system is designed for legal research assistance and should not be considered a substitute for professional legal advice. Always consult qualified legal professionals for specific legal matters.