I recently went down a rabbit hole digging into the Hermes agent- my goal was to get a deeper understanding of LLM context engineering and memory management in practice.
Going through each line of code manually to reconstruct a mental model is tedious, so I did what any reasonable engineer would do : I asked Claude and ChatGPT to map out the complete flow for me, from raw user input all the way to the streamed response on screen.
While reading through that flow, I discovered something worth paying attention to - Hermes uses Honcho as an external memory provider. Honcho isn't just a data store; it brings dialectic reasoning over past conversations, deeper user modeling, semantic search, derived conclusions, and per-peer/profile isolation to the table. Disable it, and you're running a meaningfully different system. Architecturally, earlier versions of Hermes were even more tightly coupled to Honcho than they are today.
How Hermes manages memory
At its core, Hermes takes a two-layer approach to persistence:
- Raw layer - all messages are stored in
state.db(SQLite) and flat JSON files - Curated layer- distilled memory lives in
MEMORY.mdandUSER.md, capped at 2,200 characters
The curated layer is what makes it interesting. It isn't append-only - as conversations grow, older memories get evicted and replaced with newer, more relevant ones. It's a rolling, importance-weighted summary of what matters.
The context bloat problem- and how Hermes solves it cheaply
Here's a question I had early on: Hermes uses SQLite FTS5 for lexical search over stored messages. What happens when a query returns too many matches? If you naively dump all of them into the final LLM call, you blow up the context window- and your costs along with it.
Hermes handles this with a two-model pattern. A cheap auxiliary model acts as a filter, reviewing the candidate results and deciding what actually deserves to make it into the main context. Only the filtered, relevant subset goes into the final, expensive model call. It's not free- the auxiliary call does cost tokens- but it's a smart trade-off that keeps the main call lean.
The Core Truth
- Hermes is a tool-using agent with small built-in persistent memory, full transcript storage, and optional pluggable semantic memory backends.
- Hermes is not a local-first lifelong semantic memory system by default.
- Hermes does NOT load your full history into every prompt.
- It loads a small curated memory snapshot always, and retrieves past session context ON DEMAND - only when the main model decides to call a tool for it.
- Without an external memory provider (Honcho etc.):
- Past session recall = keyword search (SQLite FTS5)
- If you used different words back then, it may not find it
- The LLM must decide to search, nothing is auto-injected from past sessions
COMPLETE MENTAL MODEL
1. YOU → Send a Message
You type a message in the CLI, or send one via Telegram / Discord / Slack / WhatsApp / Signal / Matrix / webhook / API.
2. Entry Point (cli.py or gateway/)
- Restores the current session if
--continue/-cflag was passed (looks up the most recent session from SQLite by source platform) - Creates a new AIAgent instance (run_agent.py → AIAgent.init()) OR reuses a running one in gateway mode (one agent per incoming message)
- Passes your message into agent.run_conversation(user_message)
One AIAgent is created per session. In gateway mode a fresh AIAgent is created for each incoming platform message, but it reloads the saved conversation history so continuity is preserved.
3. AIAgent Initialization (run_agent.py → init)
Runs once per agent lifetime. Loads everything that is always present.
3a. Client Setup
- Resolves LLM provider credentials (OpenRouter, Anthropic native, OpenAI, Kimi, MiniMax, GitHub Copilot, etc.)
- Selects API mode: chat_completions | anthropic_messages | codex_responses
- Enables Anthropic prompt caching if provider is Claude (caches stable system prompt prefix → ~75% input cost reduction on repeat turns)
3b. Tool Loading (model_tools.py → get_tool_definitions)
- Discovers all available tools filtered by enabled_toolsets / disabled_toolsets
- Loads: terminal, file ops, web search, browser, memory, session_search, skill tools, todo, delegate (subagent), clarify, and more (~40+ tools)
3c. Memory Loading (tools/memory_tool.py → MemoryStore)
Loaded from disk at startup. Always injected into every prompt.
~/.hermes/MEMORY.md - agent's compact working notes
written by the agent itself, curated over time
has a character limit (default ~2200 chars)
~/.hermes/USER.md - your compact profile and preferences
also agent-curated, also character-limited (~1375 chars)
These are NOT your full history. They are a small bounded cache of what the background memory reviewer decided was worth keeping. If the agent never wrote something here, it is not here.
3d. Session Database (hermes_state.py → SessionDB)
- Opens ~/.hermes/state.db (SQLite, WAL mode for concurrent access)
- Creates a new row in the sessions table for this session
- FTS5 virtual table (messages_fts) is already live and auto-indexed via INSERT/UPDATE/DELETE triggers on the messages table
3e. External Memory Provider (optional, not default)
Only if configured in ~/.hermes/config.yaml. Hermes supports a pluggable MemoryProvider interface. Currently shipped providers: - Honcho (cloud-backed dialectic user modeling, semantic recall) - Hindsight (self-hosted alternative with local PostgreSQL) - Custom (drop a plugin into ~/.hermes/plugins/)
If an external provider is active: - _activate_honcho() registers extra tools into the tool list: honcho_context, honcho_profile, honcho_search, honcho_conclude - An atexit hook is registered so _honcho_sync() fires even on crash
3f. Context Compressor (agent/context_compressor.py)
- Initialized but not yet active
- Watches token usage across turns
- When conversation approaches the model's context limit (default threshold: 50%), it auto-summarizes old messages and truncates history
- After compression, a new child session is created in SQLite with parent_session_id pointing to the previous one (lineage chain)
3g. Iteration Budget (IterationBudget)
- A shared counter capping total LLM calls across the main agent AND any subagents it spawns (default: 90 iterations)
- Prevents infinite tool loops
- At 70% consumed → injects a "start wrapping up" nudge into tool results
- At 90% consumed → injects an urgent "respond now" nudge
- Budget pressure is injected into tool result JSON, not as extra messages (avoids breaking message structure or invalidating prompt cache)
4. run_conversation() - Before the First API Call
4a. Session History Loaded
- If resuming a session: loads prior messages from SQLite via SessionDB.get_messages_as_conversation()
- These messages go directly into the API-facing message list
4b. External Provider Prefetch (if active) - ASYNC
Fires immediately in a background thread, BEFORE the LLM call: _honcho_prefetch() → calls provider's context API → fetches: recent cross-session summaries + peer representation → stores result in a Future
This runs in parallel with prompt assembly so its latency is hidden. Result is injected into the user message turn just before the API call (NOT into the system prompt - keeps the cached prefix stable).
4c. System Prompt Assembly (_build_system_prompt)
Built once per session, then cached. Rebuilt only after context compression.
Assembled in order:
┌─────────────────────────────────────────────────────┐
│ Agent identity / personality (DEFAULT_AGENT_IDENTITY│
│ or contents of ~/.hermes/SOUL.md if present) │
│ │
│ Platform hints (if platform = telegram/discord/etc) │
│ │
│ Memory guidance (instruction: how to use memory) │
│ Session search guidance (instruction: when to search│
│ Skills guidance (instruction: how to use skills) │
│ │
│ Skills content (loaded from ~/.hermes/skills/) │
│ → procedural memory: reusable task playbooks │
│ │
│ Context files (if not skip_context_files): │
│ SOUL.md, AGENTS.md, .cursorrules (if present) │
│ │
│ MEMORY.md content (always, if memory_enabled) │
│ USER.md content (always, if user_profile_enabled) │
│ │
│ Ephemeral system prompt (if set : NOT saved to │
│ trajectories - used for one-off injections) │
└─────────────────────────────────────────────────────┘
4d. User Message Preparation
- Your message is appended to the in-memory message list
- Simultaneously written to SQLite (SessionDB.append_message()) → FTS5 trigger fires → message is indexed immediately
- If external provider prefetch completed: → its context is appended to your message content as a system note → "The following Honcho memory was retrieved from prior sessions..." → tagged clearly so the LLM knows this is continuity context, not your words
4e. Prompt Caching Applied (if Claude)
- agent/prompt_caching.py applies cache_control breakpoints to the message list
- Strategy: system_and_3 (up to 4 cache breakpoints)
- Stable prefix = system prompt → cached at ~1.25x write cost, read at ~0.1x
5. Main LLM API Call
The assembled prompt is sent: - System prompt (stable, cached) - Prefill messages (if any - few-shot examples) - Full conversation history for this session - Your message (with optional provider context appended) - Tool definitions (all loaded tools) - Reasoning config (effort level: low/medium/high) - Max tokens limit
Streaming is on by default. Response comes back token by token.
6. Main Model Decides: Answer or Use Tools
Path A - Model answers directly
→ Jump to Section 9 (Post-Turn)
Path B - Model calls one or more tools
→ Continue to Section 7
7. Tool Dispatch Loop
Hermes runs a loop: call tools → feed results back → model decides again. Capped by the iteration budget (default 90 LLM calls total).
7a. Parallelism Decision (_should_parallelize_tool_batch)
If the model calls multiple tools at once: - Single tool → always sequential - Multiple tools: - Any tool in NEVER_PARALLEL set (e.g. "clarify") → sequential - Any unknown/stateful tool → sequential - All tools in PARALLEL_SAFE set (read_file, web_search, etc.) AND no overlapping file paths → run concurrently via ThreadPoolExecutor (max 8 worker threads)
7b. Tool Routing (in order of precedence)
Route 1: Memory Manager Tools (if external provider active)
if self._memory_manager and self._memory_manager.has_tool(function_name):
→ forwarded to the provider's handle_tool_call()
Examples: fact_store, fact_feedback (Honcho/Hindsight specific)
Route 2: Honcho Built-in Tools (if Honcho active)
if function_name in HONCHO_TOOL_NAMES:
honcho_context → fetch cross-session user context now (on-demand)
honcho_profile → get the peer's representation document
honcho_search → semantic search across all past Honcho messages
honcho_conclude → explicitly save a conclusion about the user
Route 3: Todo Tool
if function_name == "todo":
→ in-memory TodoStore (one per session, not persisted to SQLite)
Route 4: Session Search Tool (tools/session_search_tool.py)
if function_name == "session_search":
See Section 8 for full breakdown.
Route 5: Delegate Tool (subagent spawning)
if function_name == "delegate_task":
→ spawns a child AIAgent with its own conversation loop
→ child inherits the parent's IterationBudget (shared counter)
→ child can use all tools, has its own context
→ parent waits for child to complete and returns its final answer
Route 6: All Other Tools (model_tools.py → handle_function_call)
Generic dispatcher handles:
terminal - run shell commands in configured backend
(local, Docker, SSH, Daytona, Singularity, Modal)
read_file - read file contents
write_file - write / overwrite a file
patch - apply diff-style patches
search_files - ripgrep across directory
web_search - web search via configured provider
web_extract - fetch and parse a URL
browser_* - browser automation tools
skill_view - read a skill document
skills_list - list available skills
vision_analyze - image analysis via multimodal model
clarify - ask you a question (blocks until you answer)
memory_write - write directly to MEMORY.md or USER.md
(model explicitly deciding to save something)
... and more
Tool result is appended to message history + written to SQLite. Model receives result and may call more tools or produce a final answer.
8. session_search Tool - Full Breakdown
This is how Hermes recalls past conversations WITHOUT an external provider.
8a. When It Fires
The LLM decides to call it. It is NOT automatic. The system prompt instructs: "When the user references something from a past conversation or you suspect relevant prior context exists, use session_search to recall it before asking them to repeat themselves."
So the model reads your question, reasons that past context might exist, and calls session_search with its own chosen query string.
8b. The Search (hermes_state.py → SessionDB.search_messages)
Input: query string chosen by the main LLM
e.g. "XYZ issue solution fix"
Step 1 - FTS5 query
SELECT m.id, m.session_id, m.role,
snippet(messages_fts, ...) AS snippet,
m.content, m.timestamp, s.source, s.model
FROM messages_fts
JOIN messages m ON m.id = messages_fts.rowid
JOIN sessions s ON s.id = m.session_id
WHERE messages_fts MATCH 'XYZ issue solution fix'
ORDER BY rank ← FTS5 BM25 relevance score
LIMIT 20
Query is sanitized first:
- Balanced quoted phrases preserved
- FTS5 special characters stripped
- Hyphenated terms auto-quoted ("chat-send" → preserved as phrase)
- Dangling boolean operators removed
Step 2 - Context enrichment
For each matching message: load 1 message before + 1 message after
from the same session (surrounding context window)
Step 3 - Grouping and truncation (session_search_tool.py)
- Group all matched messages by session_id
- Take top N unique sessions (default: 3)
- For each kept session: load its full conversation from SQLite
- Truncate to ~100K characters centered on the matched messages
Step 4 - LLM Summarization (agent/auxiliary_client.py)
The truncated session chunks are sent to a FAST cheap auxiliary model
(separate from the main model - configured as summary_model).
The auxiliary model writes a focused summary per session:
"In session from [date], the user asked about X.
They solved it by doing Y. Key details: Z."
Step 5 - Result returned to main model
Main model receives the per-session summaries as the tool output.
It then uses these summaries to answer your question.
8c. Critical Limitation
FTS5 is KEYWORD search, not semantic search. It matches word stems, not meaning.
You ask: "how did I solve XYZ"
Searched: words "solve", "XYZ"
WILL find: "I solved XYZ by..." ✓
WILL find: "solving the XYZ problem..." ✓ (stemming)
WON'T find: "I fixed XYZ by..." ✗ (fix ≠ solve)
WON'T find: "XYZ kept crashing..." ✗ (crash ≠ solve)
An external provider (Honcho etc.) stores vector embeddings and
finds semantically similar content regardless of word choice.
Without one, recall quality depends on word overlap.
9. Post-Turn - After the Final Answer
After the model produces its final non-tool-call response:
9a. Response Persisted
- Assistant message appended to SQLite (SessionDB.append_message())
- FTS5 trigger fires → answer text is immediately searchable
- Session token counts updated (input + output + cache read/write tokens)
- Estimated cost calculated and stored
9b. External Provider Sync (if active)
_honcho_sync() fires asynchronously:
- Saves user turn as an observation to the provider
- Saves assistant turn as an observation
- Provider's Deriver pipeline processes these asynchronously
(updates peer representation, session summaries - happens in background)
9c. Memory Nudge Check
self._turns_since_memory is incremented.
If turns_since_memory >= nudge_interval (default: 10):
A background sub-AIAgent is spawned.
It reads the current session's conversation history.
It decides what (if anything) is worth persisting.
It may call memory_write tool to update MEMORY.md / USER.md.
The flush is gated by flush_min_turns (default: 6 turns minimum
before any write is attempted).
NOTE: This sub-agent creates its own _memory_manager pointing at
the same SQLite file. Known bug (#5129) - can cause duplicate
memory extractions. Fix: share the parent's instance.
9d. Skill Nudge Check
self._iters_since_skill is incremented.
If >= skill_nudge_interval (default: 15):
Model is reminded it can create a new skill document from this session
if the task was complex enough to be worth reusing.
9e. Context Compression Check
ContextCompressor checks estimated token count of full message history.
If > threshold (default: 50% of model's context length):
- Old messages are summarized and truncated
- A new child session is created in SQLite
(parent_session_id = current session ID)
- System prompt is rebuilt (_cached_system_prompt = None)
- Conversation continues in the new child session transparently
10. Where Data Actually Lives (No Honcho)
~/.hermes/
├── state.db SQLite database (WAL mode)
│ ├── sessions metadata: id, source, model, title,
│ │ timestamps, token counts, cost estimates
│ ├── messages full message history for every session
│ │ role, content, tool_calls, tool_results
│ └── messages_fts FTS5 virtual table - auto-synced via triggers
│ this is what session_search queries
│
├── sessions/ raw JSONL transcripts (gateway mode)
│ ├── session_<id>.json full transcript including tool call details
│ └── sessions.json maps platform session keys → session IDs
│
├── MEMORY.md compact curated agent notes (~2200 char limit)
├── USER.md compact curated user profile (~1375 char limit)
├── SOUL.md agent personality / identity (if customized)
│
├── skills/ procedural memory (skill documents)
│ └── *.md each skill = a reusable task playbook
│ agent creates these automatically after
│ complex multi-tool tasks
│
├── logs/
│ ├── agent.log structured agent logs
│ └── errors.log WARNING+ entries, rotating (2MB × 2 backups)
│
└── config.yaml all configuration
11. Recall Quality Tiers (No Honcho)
TIER 1 - Instant, always works
Something is in MEMORY.md or USER.md.
It is always in the system prompt. The LLM sees it before you finish typing.
Fails if: the background reviewer never wrote it there.
TIER 2 - Works if words match
Model calls session_search.
FTS5 finds messages containing similar keywords.
Auxiliary LLM summarizes them.
Main LLM uses the summaries to answer.
Fails if: you used different words in the original session,
or the model doesn't decide to search.
TIER 3 - Silent failure
Words don't match AND nothing in MEMORY.md.
The data exists in state.db but FTS5 can't surface it.
The model either says it doesn't know or hallucinates.
TIER 4 (with Honcho) - Semantic recall
Honcho stores vector embeddings of every turn.
"how did I solve XYZ" finds "I fixed the XYZ crash" because
the sentences are close in embedding space regardless of word choice.
Prefetched async before every turn - always injected into user message.
What Hermes IS
- A tool-using agent with a bounded always-loaded memory snapshot
- A large searchable archive of past sessions (keyword-based by default)
- A system that retrieves past context ON DEMAND via tool calls
- A closed learning loop: it creates and improves skills from experience
- Optionally upgradeable with a semantic external memory provider
What Hermes IS NOT
- Not your full session history loaded into every prompt
- Not a passively running background process watching your computer
- Not semantically aware of past sessions by default (keyword only)
- Not automatically aware of unrelated terminal / editor / Cursor activity
- Not guaranteed to recall old issues unless: → written into MEMORY.md/USER.md after that session, OR → findable by FTS5 keyword match in past transcripts, OR → an external provider (Honcho etc.) adds semantic recall