A modular memory system for AI agents that combines short-term, episodic, and semantic memory to enable coherent long-term conversations — even with small models and limited context windows (e.g., Ollama, LocalAI, or LLaMA.cpp).
This project provides a three-layered memory design inspired by human cognition:
| Memory Type | Storage | Purpose |
|---|---|---|
| Short-Term Memory | Redis | Keeps recent messages for active conversation context |
| Episodic Memory | PostgreSQL | Stores summarized conversation episodes once a context window limit is reached |
| Semantic (Knowledge) Memory | Vector Database (e.g. FAISS, Qdrant, Chroma) | Retains facts, summaries, and documents for retrieval and reasoning |
Each layer plays a specific role in maintaining context, learning from prior interactions, and providing user-specific knowledge recall.
The Short-Term Memory stores the most recent chat messages in a fast cache (Redis). It allows the model to maintain continuity during a session.
Schema:
session_id # unique UUID for the chat session
role # "human" or "ai"
message # chat content
token_count # running token total for this session
Trigger: When the token count reaches approximately 80% of the model’s context window, the conversation is summarized and offloaded to long-term memory.
The Episodic Memory stores high-level summaries of conversations in a relational database (PostgreSQL). Each record represents a complete episode of interaction — what the user asked, how the AI responded, and a combined summary for context restoration.
Schema:
session_id
human_summary
ai_summary
combined_summary
timestamp
After saving a summary, the model receives the combined summary as the new context, allowing it to continue the conversation seamlessly.
The Semantic Memory is a vector-based knowledge store that contains embeddings of:
- past conversation summaries
- user-uploaded documents
- extracted facts and relations
This allows semantic search and knowledge grounding across sessions. It is user-dependent, meaning each user can recall and query their own history and knowledge base.
Example:
“Remind me how we configured Redis last week.” The system retrieves the relevant episodic summary via vector similarity.
┌────────────────────────┐
│ User Interaction │
└────────────┬───────────┘
│
▼
┌────────────────────┐
│ Short-Term Memory │ ← Redis (recent chat)
└────────┬───────────┘
│ summarize when ~80% context used
▼
┌────────────────────┐
│ Episodic Memory │ ← PostgreSQL (summaries)
└────────┬───────────┘
│ embed summaries
▼
┌────────────────────┐
│ Knowledge Memory │ ← Vector DB (facts, embeddings)
└────────────────────┘
- Enable context continuity beyond small model limits
- Support memory persistence between sessions
- Provide semantic recall across conversations
- Build foundation for fine-tuning datasets (from episodic memory)
| Component | Technology |
|---|---|
| Cache | Redis |
| Database | PostgreSQL / SQLModel |
| Vector Store | FAISS / Milvus |
| Language Models | Ollama / LocalAI / LLaMA.cpp |
| API Layer | Flask / FastAPI |
- User starts a chat → new session_id (UUID)
- Messages are stored in Short-Term Memory (Redis)
- On each new user/assistant message: check token usage
- If ~80% context is reached:
- Generate human_summary, ai_summary, combined_summary
- Persist to Episodic Memory (SQL)
- Write combined_summary back into Redis as a context snapshot (e.g., a system/memory message) and optionally prune old turns
- Continue the same turn: build the prompt from the snapshot + recent tail and answer the user’s message based on that
- Optionally (async): embed the combined_summary and index it in Knowledge Memory (VectorDB) for later recall
- Add procedural memory for tool use and agent behavior patterns
- Implement memory pruning and relevance scoring
- Integrate LangGraph for agent orchestration
- Enable user-personalized knowledge graphs
MIT License © 2025 [Christopher Abanilla @ Evolvingstack]