Building a Multi-Agent Fleet Router in Pure Python: 8 Profiles, 26 Tests, Zero Dependencies

I run 8 AI agent profiles on my Mac. Each has its own toolset, model, and skill set — Sonya handles general work, Cody writes code, Fred does marketing, Tilly runs Telegram ops. When a prompt arrives, something has to decide which agent handles it. The obvious answer is "ask an LLM to classify it." That's also the expensive, slow, fragile answer.

This is the routing layer that replaced it — and four bugs that almost shipped.

The Architecture

The system has three layers:

inbound prompt
    │
    ▼
┌─────────────────┐
│   route.py      │  ← pure Python, stdlib only
│   (classifier)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  profile spin   │  ← hermes --profile cody
│  (dispatch)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  isolated agent │  ← own SOUL.md, config, skills, memory
│  execution      │
└─────────────────┘

Each profile lives under ~/.hermes/profiles/<name>/ with SOUL.md, config.yaml, skills/, cron/, and memories/. They're real agent instances — isolated context, own tool schemas, own memory. The router is a single file that runs before any LLM call. $0 per dispatch.

The Router

import re
from collections import defaultdict

# Keyword weights per agent. Higher = stronger signal.
KEYWORDS = {
    "cody": {
        "code": 5, "debug": 5, "repo": 4, "bug": 4, "build": 3,
        "auth": 3, "schema": 3, "traceback": 3, "flow": 2,
        "deploy": 2, "refactor": 4, "merge": 2,
    },
    "fred": {
        "launch": 4, "product": 4, "marketing": 5, "sales": 5,
        "offer": 4, "hormozi": 5, "funnel": 4, "ad": 3,
        "copy": 3, "conversion": 4, "review": 3, "reputation": 4,
    },
    "tilly": {
        "telegram": 5, "bot": 4, "lead": 3, "handler": 3,
        "inbound": 3, "message": 2, "gateway": 4,
    },
    # ... 5 more agents
}

CODE_SIGNALS = {
    "stack trace", "traceback", "syntax error", "null pointer",
    "segfault", "compile", "type error", "import error",
}

def route(prompt: str, threshold: float = 0.10) -> str:
    """Classify a prompt to the best-fit agent profile."""
    prompt_lower = prompt.lower()

    # Phase 1: keyword scoring
    scores = defaultdict(float)
    max_possible = defaultdict(float)

    for agent, kws in KEYWORDS.items():
        for kw, weight in kws.items():
            max_possible[agent] += weight
            if re.search(r'\b' + re.escape(kw) + r'\b', prompt_lower):
                scores[agent] += weight

    # Normalize to 0-1
    normalized = {
        agent: scores[agent] / max_possible[agent]
        for agent in KEYWORDS
        if max_possible[agent] > 0 and scores[agent] > 0
    }

    if not normalized:
        # Phase 2: code-signal fallback
        if any(sig in prompt_lower for sig in CODE_SIGNALS):
            return "cody"
        return "sonya"  # general fallback

    # Phase 3: threshold + code-signal boost
    top_agent = max(normalized, key=normalized.get)
    top_score = normalized[top_agent]

    if top_score < threshold:
        # Low confidence — check for code signals before defaulting
        if any(sig in prompt_lower for sig in CODE_SIGNALS):
            return "cody"
        return "sonya"

    return top_agent

The whole thing is 60 lines. No embeddings, no API calls, no latency beyond regex matching. It runs in microseconds.

Bug #1: The Relative Path Trap

While building a resource-monitoring tool called molt — which wraps any command N times and measures whether it's silently accumulating file growth or memory bloat — the first run reported +618 bytes per iteration on a command that should produce zero growth. The .molt/ directory was polluting its own snapshot.

The fix looked obvious: skip .molt/ during the os.walk:

if Path(root) == MOLT_DIR or MOLT_DIR in Path(root).parents:
    continue

Still +618 bytes. The continue inside an os.walk loop skips the current iteration's file processing but doesn't prune the walk itself — os.walk still descends into the directory on the next pass. And worse: MOLT_DIR was defined as Path('.molt') (relative), while Path(root) returns an absolute path. They never match.

Two fixes: mutate dirs[:] in-place to prune the walk, and compare by directory name instead of full path:

# Prune .molt from the walk
dirs[:] = [d for d in dirs if d != '.molt']

After this, fs_bytes=0 for stable commands. The diagnostic was real.

Lesson: os.walk pruning requires mutating the dirs list in-place, not continue-ing out of the loop. And path comparison only works when both sides are the same type (absolute vs. relative). These are stdlib gotchas that bite silently — no exception, just wrong numbers.

Bug #2: Thinking Models Eat the Reply

The idea command — a one-shot abstract idea generator piped through Ollama Cloud — returned empty on its first test. The model was qwen3.5:397b-cloud, a thinking model. The output came back with content: "" and thinking: "<1600 words of chain-of-thought>".

The API returns two fields: .content (the answer) and .thinking (the reasoning). When thinking mode is on, the model puts everything in .thinking and leaves .content empty. My jq extraction pulled .content first.

Two options: parse .thinking when .content is empty, or disable thinking. Disabling is cleaner — options.think=false strips the chain-of-thought and the model delivers the actual artifact in .content. The fix was one API parameter.

Lesson: Thinking models don't just add reasoning — they replace the output field with it. If your client reads .content, you get silence. Always handle both fields, or explicitly disable thinking for structured-output tasks.

Bug #3: The Routing Threshold Was Wrong for Code

The router's first test run went 16/16 on classification — except "fix the auth flow" routed to Sonya instead of Cody. The word "auth" matched Cody with weight 3, but Cody's max possible score across all keywords was 94. So 3/94 = 0.032, well below the 0.10 confidence threshold. The router fell through to Sonya (general fallback).

The threshold was calibrated for agents with 5-10 keywords. Cody had 12, which diluted every individual match. Two fixes:

Add a code-signal fallback that fires before the threshold check: if the prompt contains explicit programming signals ("traceback", "syntax error", "stack trace"), route to Cody regardless of keyword score.
For multi-keyword agents, let strong individual signals (weight ≥ 4) bypass the normalized threshold.

After the fix: 26/26 routing tests pass. Including edge cases like "client wants the new auth flow" (routes to Alistair, not Cody — "client" is Alistair's signal, and the code-signal fallback only fires when no specialist matches at all).

Lesson: Normalized scoring punishes agents with broad keyword coverage. A specialist with 15 keywords will never hit a threshold designed for a specialist with 5. Either use raw scores with per-agent thresholds, or add domain-specific fallbacks that bypass normalization entirely.

Bug #4: Symlinks and Cron Don't Mix

The nightly profile updater needed a cron job. I created a symlink pointing from the cron script path to the actual Python script. Cron rejected it silently — the job registered but never executed.

Hermes cron (and most cron implementations) validate the script path and refuse symlinks. The fix: write a real wrapper shell script that calls python3 build_profile.py, not a symlink.

#!/bin/bash
# Real file, not symlink — cron validates the path
cd ~/.hermes/operator-profile
python3 build_profile.py >> ~/.hermes/operator-profile/update.log 2>&1

Lesson: Cron's path validation rejects symlinks. This is a security feature (prevents privilege escalation through symlink swapping), but it's undocumented in most cron implementations and fails silently. Always use real files for cron entries.

What Shipped

~/.hermes/operator-profile/
├── PROFILE.md          # operator's voice, scope, financial targets
├── FLEET.md            # 8 agents, routing rules, project bindings
├── PATTERNS.md         # auto-extracted from 868 sessions (3.2s runtime)
├── TOPICS.md           # topic distribution: 64% agent-systems, 28% marketing, 18% code
├── build_profile.py    # stdlib-only nightly regenerator, $0/run
├── route.py            # 60-line classifier, 26/26 tests pass
├── scaffold_fleet.py   # idempotent profile scaffolder
└── update_profile.sh   # real file (not symlink), cron-registered

8 profile directories under ~/.hermes/profiles/, each with SOUL.md, config, skills, cron, memories. 7 cron drop jobs wired to read PROFILE.md before generating — every Telegram push is filtered through the operator's actual voice.

The nightly profile builder runs at 03:30, analyzes session history from state.db, regenerates PATTERNS.md and TOPICS.md. Zero LLM in the update loop. 3.2 seconds, $0.

The Bigger Pattern

The routing problem — "which agent handles this?" — is usually solved with an LLM call. That adds latency, cost, and a failure mode. A keyword-weighted classifier with domain-specific fallbacks handles 95% of cases correctly, runs in microseconds, and has zero runtime dependencies.

The 5% edge cases are where the interesting design lives. The answer isn't a smarter classifier — it's a fallback chain: try keywords, try code signals, try default. Each layer catches what the previous one missed.

Don't use an LLM for routing when a regex will do. Save the LLM calls for the work the agents actually do.