Skip to content

RA1NCS/hermes

Repository files navigation

Hermes

Context memory layer for AI coding agents. Extracts architectural decisions from conversations, stores them permanently, and injects them into every prompt so the agent never forgets what was decided, no matter how long the conversation gets.

Evaluated on LoCoEval (48-56 turn repository-oriented coding conversations, 40K token context limit).

Architecture


Why?

Every AI app builder (Bolt, Lovable, v0) treats conversation history as a FIFO buffer. Context fills up, old messages get silently dropped, and the agent starts contradicting itself. User picked Postgres at message 3, defined a schema at message 8, chose JWT auth at message 12. By message 40, all of that is gone.

Hermes fixes this by doing two things before information can be lost:

  • Extraction: After every response, pull out architectural decisions (schema choices, auth methods, routing patterns, dependencies) and store them permanently.
  • Compaction: At task boundaries, summarize old messages instead of dropping them. Topic awareness survives even when raw messages don't.

The agent always knows what was decided (extraction) and what was discussed (compaction). Conversation length becomes irrelevant.


How It Works

TypeScript Bun HTTP server. Three endpoints: /init, /query, /state.

Every round:

  1. Check if conversation topic shifted or context is near limit. If yes, compact.
  2. Assemble prompt: extracted decisions at the top, conversation summary near the query, recent messages in between. (Lost-in-the-Middle placement.)
  3. Call Gemini. Get answer.
  4. Extract decisions from the answer asynchronously. Dedup against existing store (CREATE / MERGE / SKIP).
  5. Return answer.

Memory model: Two tiers per decision. L0 (one-sentence summary) and L1 (structured detail). All decisions from 4 categories (structure, behavior, relationships, decisions) always loaded. Under 4K tokens for 30+ decisions. No classifier, no embeddings, no routing logic.

Compaction triggers: Token pressure (35K threshold) or topic shift gated on pressure (>25K). Old messages get summarized, summaries compound across compactions.


What the LLM Sees

This is the LitM (Lost-in-the-Middle) placement that makes the whole thing work. The summary sits as a synthetic assistant message right before the query, in the high-attention recency zone. This single positioning change took TA from 0.541 to 0.899.

LitM Context


Benchmark

Evaluated on LoCoEval multi-hop coding conversations (48-56 turns, 40K token context limit) across two repositories (Kinto, Falcon). Metrics: Information Extraction (IE) measures fact recall accuracy, Topic Awareness (TA) measures whether the agent remembers what was discussed.

Function Completion (FC) was excluded because it tests the underlying model's code generation capability, which is orthogonal to context management. Both agents scored identically on FC with the same functions failing, confirming it measures model skill, not context strategy.

TruncateAgent HermesAgent Delta
Topic Awareness F1 0.381 0.736 +93%
Information Extraction F1 0.763 0.652 -14.5%

Per-repo breakdown:

Repo Metric TruncateAgent HermesAgent Delta
Kinto TA 0.362 1.000 +176%
Kinto IE 0.728 0.647 -11%
Falcon TA 0.400 0.471 +18%
Falcon IE 0.797 0.657 -18%

Both agents use identical code retrieval (SimilarFunctionParser), same Gemini Flash backbone, same mock user, same judge. Only variable is context management strategy.

Why TA wins: Extracted decisions + compaction summaries preserve topic awareness that truncation silently drops. Kinto's perfect 1.000 means Hermes remembered every topic from a 48-turn conversation.

Why IE drops: Hermes has higher recall (finds more ground truth facts) but lower precision (generates more false positives). The decisions block gives the model architectural context that encourages over-elaboration. This is a generation-side issue, not a context management failure, and a clear target for iteration.


Where Hermes Fits

Tool What it solves Hermes relationship
OMEGA, Mem0, Zep Cross-session memory (between conversations) Complementary. Hermes is within-session.
Letta (MemGPT) Full agent runtime with memory Hermes is middleware, not a runtime. No rewrite needed.
Deep Agents, FlashCompact Context compression Hermes adds decision extraction on top of compression.
Factory.ai Structured coding agent compression Validates our approach. Same insight, different implementation.

Stack

Server TypeScript, Bun
LLM Gemini 3 Flash (backbone, extraction, compaction), Gemini 3.1 Flash Lite (dedup)
Memory In-memory Map, 4 categories
Benchmark LoCoEval (Python, unmodified except Gemini adapter patches)
Bridge 25-line Python wrapper forwarding HTTP to the TS server


Quickstart

# Install dependencies
bun install

# Set your Gemini API key
export GEMINI_API_KEY="your-key-here"

# Start Hermes server
bun run src/index.ts

# Inspect memory store
curl localhost:3000/state | jq

Status

Research prototype. The core pipeline works and benchmarks well on TA, but IE precision needs iteration. See LEARNINGS.md for the full development story.

Contributions welcome — especially around extraction prompt tuning, decision compaction, and alternative dedup strategies.


Research

Built on: Lost-in-the-Middle (Stanford/Berkeley) | OpenViking L0/L1/L2 (ByteDance, 2026) | Deep Agents Compaction (LangChain, 2026) | ACE Pattern (Zhang et al., 2026) | Factory.ai Structured Compression (2025)

About

Context memory layer for AI coding agents. Built on 5 research papers, benchmarked on LoCoEval.

Topics

Resources

License

Stars

Watchers

Forks

Contributors