I recently went down a rabbit hole digging into the Hermes agent- my goal was to get a deeper understanding of LLM context engineering and memory management in practice.

Going through each line of code manually to reconstruct a mental model is tedious, so I did what any reasonable engineer would do : I asked Claude and ChatGPT to map out the complete flow for me, from raw user input all the way to the streamed response on screen.

While reading through that flow, I discovered something worth paying attention to - Hermes uses Honcho as an external memory provider. Honcho isn't just a data store; it brings dialectic reasoning over past conversations, deeper user modeling, semantic search, derived conclusions, and per-peer/profile isolation to the table. Disable it, and you're running a meaningfully different system. Architecturally, earlier versions of Hermes were even more tightly coupled to Honcho than they are today.

How Hermes manages memory

At its core, Hermes takes a two-layer approach to persistence:

Raw layer - all messages are stored in state.db (SQLite) and flat JSON files
Curated layer- distilled memory lives in MEMORY.md and USER.md, capped at 2,200 characters

The curated layer is what makes it interesting. It isn't append-only - as conversations grow, older memories get evicted and replaced with newer, more relevant ones. It's a rolling, importance-weighted summary of what matters.

The context bloat problem- and how Hermes solves it cheaply

Here's a question I had early on: Hermes uses SQLite FTS5 for lexical search over stored messages. What happens when a query returns too many matches? If you naively dump all of them into the final LLM call, you blow up the context window- and your costs along with it.

Hermes handles this with a two-model pattern. A cheap auxiliary model acts as a filter, reviewing the candidate results and deciding what actually deserves to make it into the main context. Only the filtered, relevant subset goes into the final, expensive model call. It's not free- the auxiliary call does cost tokens- but it's a smart trade-off that keeps the main call lean.

The Core Truth

Hermes is a tool-using agent with small built-in persistent memory, full transcript storage, and optional pluggable semantic memory backends.
Hermes is not a local-first lifelong semantic memory system by default.
Hermes does NOT load your full history into every prompt.
It loads a small curated memory snapshot always, and retrieves past session context ON DEMAND - only when the main model decides to call a tool for it.
Without an external memory provider (Honcho etc.):
- Past session recall = keyword search (SQLite FTS5)
- If you used different words back then, it may not find it
- The LLM must decide to search, nothing is auto-injected from past sessions

COMPLETE MENTAL MODEL

1. YOU → Send a Message

You type a message in the CLI, or send one via Telegram / Discord / Slack / WhatsApp / Signal / Matrix / webhook / API.

2. Entry Point (cli.py or gateway/)

Restores the current session if --continue / -c flag was passed (looks up the most recent session from SQLite by source platform)
Creates a new AIAgent instance (run_agent.py → AIAgent.init()) OR reuses a running one in gateway mode (one agent per incoming message)
Passes your message into agent.run_conversation(user_message)

One AIAgent is created per session. In gateway mode a fresh AIAgent is created for each incoming platform message, but it reloads the saved conversation history so continuity is preserved.

3. AIAgent Initialization (run_agent.py → init)

Runs once per agent lifetime. Loads everything that is always present.

3a. Client Setup

Resolves LLM provider credentials (OpenRouter, Anthropic native, OpenAI, Kimi, MiniMax, GitHub Copilot, etc.)
Selects API mode: chat_completions | anthropic_messages | codex_responses
Enables Anthropic prompt caching if provider is Claude (caches stable system prompt prefix → ~75% input cost reduction on repeat turns)

3b. Tool Loading (model_tools.py → get_tool_definitions)

Discovers all available tools filtered by enabled_toolsets / disabled_toolsets
Loads: terminal, file ops, web search, browser, memory, session_search, skill tools, todo, delegate (subagent), clarify, and more (~40+ tools)

3c. Memory Loading (tools/memory_tool.py → MemoryStore)

Loaded from disk at startup. Always injected into every prompt.

~/.hermes/MEMORY.md   - agent's compact working notes
                        written by the agent itself, curated over time
                        has a character limit (default ~2200 chars)

~/.hermes/USER.md     - your compact profile and preferences
                        also agent-curated, also character-limited (~1375 chars)

These are NOT your full history. They are a small bounded cache of what the background memory reviewer decided was worth keeping. If the agent never wrote something here, it is not here.

3d. Session Database (hermes_state.py → SessionDB)

Opens ~/.hermes/state.db (SQLite, WAL mode for concurrent access)
Creates a new row in the sessions table for this session
FTS5 virtual table (messages_fts) is already live and auto-indexed via INSERT/UPDATE/DELETE triggers on the messages table

3e. External Memory Provider (optional, not default)

Only if configured in ~/.hermes/config.yaml. Hermes supports a pluggable MemoryProvider interface. Currently shipped providers: - Honcho (cloud-backed dialectic user modeling, semantic recall) - Hindsight (self-hosted alternative with local PostgreSQL) - Custom (drop a plugin into ~/.hermes/plugins/)

If an external provider is active: - _activate_honcho() registers extra tools into the tool list: honcho_context, honcho_profile, honcho_search, honcho_conclude - An atexit hook is registered so _honcho_sync() fires even on crash

3f. Context Compressor (agent/context_compressor.py)

Initialized but not yet active
Watches token usage across turns
When conversation approaches the model's context limit (default threshold: 50%), it auto-summarizes old messages and truncates history
After compression, a new child session is created in SQLite with parent_session_id pointing to the previous one (lineage chain)

3g. Iteration Budget (IterationBudget)

A shared counter capping total LLM calls across the main agent AND any subagents it spawns (default: 90 iterations)
Prevents infinite tool loops
At 70% consumed → injects a "start wrapping up" nudge into tool results
At 90% consumed → injects an urgent "respond now" nudge
Budget pressure is injected into tool result JSON, not as extra messages (avoids breaking message structure or invalidating prompt cache)

4. run_conversation() - Before the First API Call

4a. Session History Loaded

If resuming a session: loads prior messages from SQLite via SessionDB.get_messages_as_conversation()
These messages go directly into the API-facing message list

4b. External Provider Prefetch (if active) - ASYNC

Fires immediately in a background thread, BEFORE the LLM call: _honcho_prefetch() → calls provider's context API → fetches: recent cross-session summaries + peer representation → stores result in a Future

This runs in parallel with prompt assembly so its latency is hidden. Result is injected into the user message turn just before the API call (NOT into the system prompt - keeps the cached prefix stable).

4c. System Prompt Assembly (_build_system_prompt)

Built once per session, then cached. Rebuilt only after context compression.

Assembled in order:
┌─────────────────────────────────────────────────────┐
│ Agent identity / personality (DEFAULT_AGENT_IDENTITY│
│   or contents of ~/.hermes/SOUL.md if present)      │
│                                                     │
│ Platform hints (if platform = telegram/discord/etc) │
│                                                     │
│ Memory guidance (instruction: how to use memory)    │
│ Session search guidance (instruction: when to search│
│ Skills guidance (instruction: how to use skills)    │
│                                                     │
│ Skills content (loaded from ~/.hermes/skills/)      │
│   → procedural memory: reusable task playbooks      │
│                                                     │
│ Context files (if not skip_context_files):          │
│   SOUL.md, AGENTS.md, .cursorrules (if present)    │
│                                                     │
│ MEMORY.md content (always, if memory_enabled)       │
│ USER.md content (always, if user_profile_enabled)   │
│                                                     │
│ Ephemeral system prompt (if set : NOT saved to       │
│   trajectories - used for one-off injections)       │
└─────────────────────────────────────────────────────┘

4d. User Message Preparation

Your message is appended to the in-memory message list
Simultaneously written to SQLite (SessionDB.append_message()) → FTS5 trigger fires → message is indexed immediately
If external provider prefetch completed: → its context is appended to your message content as a system note → "The following Honcho memory was retrieved from prior sessions..." → tagged clearly so the LLM knows this is continuity context, not your words

4e. Prompt Caching Applied (if Claude)

agent/prompt_caching.py applies cache_control breakpoints to the message list
Strategy: system_and_3 (up to 4 cache breakpoints)
Stable prefix = system prompt → cached at ~1.25x write cost, read at ~0.1x

5. Main LLM API Call

The assembled prompt is sent: - System prompt (stable, cached) - Prefill messages (if any - few-shot examples) - Full conversation history for this session - Your message (with optional provider context appended) - Tool definitions (all loaded tools) - Reasoning config (effort level: low/medium/high) - Max tokens limit

Streaming is on by default. Response comes back token by token.

6. Main Model Decides: Answer or Use Tools

Path A - Model answers directly

→ Jump to Section 9 (Post-Turn)

Path B - Model calls one or more tools

→ Continue to Section 7

7. Tool Dispatch Loop

Hermes runs a loop: call tools → feed results back → model decides again. Capped by the iteration budget (default 90 LLM calls total).

7a. Parallelism Decision (_should_parallelize_tool_batch)

If the model calls multiple tools at once: - Single tool → always sequential - Multiple tools: - Any tool in NEVER_PARALLEL set (e.g. "clarify") → sequential - Any unknown/stateful tool → sequential - All tools in PARALLEL_SAFE set (read_file, web_search, etc.) AND no overlapping file paths → run concurrently via ThreadPoolExecutor (max 8 worker threads)

7b. Tool Routing (in order of precedence)

Route 1: Memory Manager Tools (if external provider active)

if self._memory_manager and self._memory_manager.has_tool(function_name):
    → forwarded to the provider's handle_tool_call()
    Examples: fact_store, fact_feedback (Honcho/Hindsight specific)

Route 2: Honcho Built-in Tools (if Honcho active)

if function_name in HONCHO_TOOL_NAMES:
    honcho_context  → fetch cross-session user context now (on-demand)
    honcho_profile  → get the peer's representation document
    honcho_search   → semantic search across all past Honcho messages
    honcho_conclude → explicitly save a conclusion about the user

Route 3: Todo Tool

if function_name == "todo":
    → in-memory TodoStore (one per session, not persisted to SQLite)

Route 4: Session Search Tool (tools/session_search_tool.py)

if function_name == "session_search":
    See Section 8 for full breakdown.

Route 5: Delegate Tool (subagent spawning)

if function_name == "delegate_task":
    → spawns a child AIAgent with its own conversation loop
    → child inherits the parent's IterationBudget (shared counter)
    → child can use all tools, has its own context
    → parent waits for child to complete and returns its final answer

Route 6: All Other Tools (model_tools.py → handle_function_call)

Generic dispatcher handles:
    terminal         - run shell commands in configured backend
                       (local, Docker, SSH, Daytona, Singularity, Modal)
    read_file        - read file contents
    write_file       - write / overwrite a file
    patch            - apply diff-style patches
    search_files     - ripgrep across directory
    web_search       - web search via configured provider
    web_extract      - fetch and parse a URL
    browser_*        - browser automation tools
    skill_view       - read a skill document
    skills_list      - list available skills
    vision_analyze   - image analysis via multimodal model
    clarify          - ask you a question (blocks until you answer)
    memory_write     - write directly to MEMORY.md or USER.md
                       (model explicitly deciding to save something)
    ... and more

Tool result is appended to message history + written to SQLite. Model receives result and may call more tools or produce a final answer.

8. session_search Tool - Full Breakdown

This is how Hermes recalls past conversations WITHOUT an external provider.

8a. When It Fires

The LLM decides to call it. It is NOT automatic. The system prompt instructs: "When the user references something from a past conversation or you suspect relevant prior context exists, use session_search to recall it before asking them to repeat themselves."

So the model reads your question, reasons that past context might exist, and calls session_search with its own chosen query string.

8b. The Search (hermes_state.py → SessionDB.search_messages)

Input: query string chosen by the main LLM
       e.g. "XYZ issue solution fix"

Step 1 - FTS5 query
    SELECT m.id, m.session_id, m.role,
           snippet(messages_fts, ...) AS snippet,
           m.content, m.timestamp, s.source, s.model
    FROM messages_fts
    JOIN messages m ON m.id = messages_fts.rowid
    JOIN sessions s ON s.id = m.session_id
    WHERE messages_fts MATCH 'XYZ issue solution fix'
    ORDER BY rank   ← FTS5 BM25 relevance score
    LIMIT 20

    Query is sanitized first:
        - Balanced quoted phrases preserved
        - FTS5 special characters stripped
        - Hyphenated terms auto-quoted ("chat-send" → preserved as phrase)
        - Dangling boolean operators removed

Step 2 - Context enrichment
    For each matching message: load 1 message before + 1 message after
    from the same session (surrounding context window)

Step 3 - Grouping and truncation (session_search_tool.py)
    - Group all matched messages by session_id
    - Take top N unique sessions (default: 3)
    - For each kept session: load its full conversation from SQLite
    - Truncate to ~100K characters centered on the matched messages

Step 4 - LLM Summarization (agent/auxiliary_client.py)
    The truncated session chunks are sent to a FAST cheap auxiliary model
    (separate from the main model - configured as summary_model).
    The auxiliary model writes a focused summary per session:
        "In session from [date], the user asked about X.
         They solved it by doing Y. Key details: Z."

Step 5 - Result returned to main model
    Main model receives the per-session summaries as the tool output.
    It then uses these summaries to answer your question.

8c. Critical Limitation

FTS5 is KEYWORD search, not semantic search. It matches word stems, not meaning.

You ask:  "how did I solve XYZ"
Searched: words "solve", "XYZ"
WILL find:  "I solved XYZ by..."  ✓
WILL find:  "solving the XYZ problem..."  ✓ (stemming)
WON'T find: "I fixed XYZ by..."  ✗ (fix ≠ solve)
WON'T find: "XYZ kept crashing..."  ✗ (crash ≠ solve)

An external provider (Honcho etc.) stores vector embeddings and
finds semantically similar content regardless of word choice.
Without one, recall quality depends on word overlap.

9. Post-Turn - After the Final Answer

After the model produces its final non-tool-call response:

9a. Response Persisted

- Assistant message appended to SQLite (SessionDB.append_message())
- FTS5 trigger fires → answer text is immediately searchable
- Session token counts updated (input + output + cache read/write tokens)
- Estimated cost calculated and stored

9b. External Provider Sync (if active)

_honcho_sync() fires asynchronously:
- Saves user turn as an observation to the provider
- Saves assistant turn as an observation
- Provider's Deriver pipeline processes these asynchronously
  (updates peer representation, session summaries - happens in background)

9c. Memory Nudge Check

self._turns_since_memory is incremented.
If turns_since_memory >= nudge_interval (default: 10):

    A background sub-AIAgent is spawned.
    It reads the current session's conversation history.
    It decides what (if anything) is worth persisting.
    It may call memory_write tool to update MEMORY.md / USER.md.
    The flush is gated by flush_min_turns (default: 6 turns minimum
    before any write is attempted).

    NOTE: This sub-agent creates its own _memory_manager pointing at
    the same SQLite file. Known bug (#5129) - can cause duplicate
    memory extractions. Fix: share the parent's instance.

9d. Skill Nudge Check

self._iters_since_skill is incremented.
If >= skill_nudge_interval (default: 15):
    Model is reminded it can create a new skill document from this session
    if the task was complex enough to be worth reusing.

9e. Context Compression Check

ContextCompressor checks estimated token count of full message history.
If > threshold (default: 50% of model's context length):
    - Old messages are summarized and truncated
    - A new child session is created in SQLite
      (parent_session_id = current session ID)
    - System prompt is rebuilt (_cached_system_prompt = None)
    - Conversation continues in the new child session transparently

10. Where Data Actually Lives (No Honcho)

~/.hermes/
├── state.db              SQLite database (WAL mode)
│   ├── sessions          metadata: id, source, model, title,
│   │                     timestamps, token counts, cost estimates
│   ├── messages          full message history for every session
│   │                     role, content, tool_calls, tool_results
│   └── messages_fts      FTS5 virtual table - auto-synced via triggers
│                         this is what session_search queries
│
├── sessions/             raw JSONL transcripts (gateway mode)
│   ├── session_<id>.json full transcript including tool call details
│   └── sessions.json     maps platform session keys → session IDs
│
├── MEMORY.md             compact curated agent notes (~2200 char limit)
├── USER.md               compact curated user profile (~1375 char limit)
├── SOUL.md               agent personality / identity (if customized)
│
├── skills/               procedural memory (skill documents)
│   └── *.md              each skill = a reusable task playbook
│                         agent creates these automatically after
│                         complex multi-tool tasks
│
├── logs/
│   ├── agent.log         structured agent logs
│   └── errors.log        WARNING+ entries, rotating (2MB × 2 backups)
│
└── config.yaml           all configuration

11. Recall Quality Tiers (No Honcho)

TIER 1 - Instant, always works
    Something is in MEMORY.md or USER.md.
    It is always in the system prompt. The LLM sees it before you finish typing.
    Fails if: the background reviewer never wrote it there.

TIER 2 - Works if words match
    Model calls session_search.
    FTS5 finds messages containing similar keywords.
    Auxiliary LLM summarizes them.
    Main LLM uses the summaries to answer.
    Fails if: you used different words in the original session,
              or the model doesn't decide to search.

TIER 3 - Silent failure
    Words don't match AND nothing in MEMORY.md.
    The data exists in state.db but FTS5 can't surface it.
    The model either says it doesn't know or hallucinates.

TIER 4 (with Honcho) - Semantic recall
    Honcho stores vector embeddings of every turn.
    "how did I solve XYZ" finds "I fixed the XYZ crash" because
    the sentences are close in embedding space regardless of word choice.
    Prefetched async before every turn - always injected into user message.

What Hermes IS

A tool-using agent with a bounded always-loaded memory snapshot
A large searchable archive of past sessions (keyword-based by default)
A system that retrieves past context ON DEMAND via tool calls
A closed learning loop: it creates and improves skills from experience
Optionally upgradeable with a semantic external memory provider

What Hermes IS NOT

Not your full session history loaded into every prompt
Not a passively running background process watching your computer
Not semantically aware of past sessions by default (keyword only)
Not automatically aware of unrelated terminal / editor / Cursor activity
Not guaranteed to recall old issues unless: → written into MEMORY.md/USER.md after that session, OR → findable by FTS5 keyword match in past transcripts, OR → an external provider (Honcho etc.) adds semantic recall