Multi-Agent Autonomous Development: Runbook

The runbook

This is the operational companion to Multi-Agent Autonomous Development — the strategy doc that explains why we built this stack. This doc explains how to use it: one-time setup, the per-task loop, the soak period, parallel tasks across cmux workspaces, honest gotchas, and what’s still deferred.

The shape, as of aileron-tools#4: one cmux pane kicks off aileron-agent task <slug>, you converse with the brainstormer, the chain detaches, and you get a notification at a terminal state (done-merged, done-held-for-review, or done-failed). The supervisor chains four role-tuned containers — brainstormer → planner → coder → reviewer — and an auto-merge gate evaluated outside the container trust boundary via gh api. The coder phase delegates to the existing handoff primitive; that’s where v0’s containerization wedge lives.

Phase 0 closed (aileron-internal#34); Phases 2 and 3 of the strategy doc’s rollout plan are now shipped behind the aileron-agent supervisor. The next jump — fully unattended operation — is gated on your own trust in the chain (the soak ramp), not on more code.

One-time setup

Four small configuration steps. Once these are in place you don’t touch them again. ~15 minutes (most of which is the bats install + first image build, both backgroundable).

Step 1

Install both binaries

From a clone of aileron-tools:

cd ~/git/ALRubinger/aileron-tools
./install.sh

Symlinks ~/.local/bin/handoff AND ~/.local/bin/aileron-agent to their sources in the repo. ~/.local/bin is the standard XDG user-binary location and is usually already on PATH; install.sh tells you if not and shows the exact line to add to your shell rc. In a fresh shell, aileron-agent —help + handoff —help verify. The supervisor (aileron-agent) is what you’ll use day-to-day; handoff stays around as the work-execution primitive the coder phase delegates to.

Step 2

Three GitHub credentials + two API keys

In ~/.zshrc (or sourced from a secrets file / Keychain). The chain’s minimum-privilege contract requires three distinct GitHub tokens:

# Container creds (the coder container uses these for git push + gh pr create):
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-proj-...
export GH_TOKEN=...                    # contents:write + pull_requests:write

# Reviewer container's token (separate scope, comment-only):
export REVIEWER_GH_TOKEN=...           # pull_requests:write only

Use fine-grained PATs, not classic. Each token should be scoped to exactly what its row needs. The host’s default GH_TOKEN is the coder container’s only credential; the reviewer container never sees it. The merge token (Step 3) is host-side only and never enters any container. The supervisor validates each as it’s needed; missing tokens hold-for-review rather than crash.

Step 3

Host-side merge credential + allowlist

This step replaces the old ~/.zshrc cmux launcher snippet, which was removed when the auto-launch-into-Claude pattern lost its rationale (the brainstormer now lives inside a container the supervisor invokes; cmux panes can land in plain shell). In ~/.zshrc:

# Host-side merge token — fine-grained PAT scoped to the repos below.
# NEVER use a classic PAT here. Macos Keychain preferred over plaintext rc.
export AILERON_AGENT_MERGE_TOKEN=...

# Default-deny merge allowlist. Space-separated owner/repo entries.
# Supervisor refuses to gh pr merge if PR's target repo isn't in this list.
export AILERON_AGENT_MERGE_REPO_ALLOWLIST="ALRubinger/aileron-tools ALRubinger/aileron-connector-google ALRubinger/aileron-connector-slack ALRubinger/aileron-connector-bluebubbles"

The merge token is the most consequential credential in the stack — it’s the only thing that can land code on a default branch. Keep its scope narrow and rotate it on a normal cadence. Without these two vars, every chain run holds-for-review at the merge gate by design (default-deny).

Step 4

Sound hook on Stop event

In ~/.claude/settings.json:

{ "hooks": { "Stop": [{ "hooks": [{ "type": "command", "command": "afplay /System/Library/Sounds/Blow.aiff" }]}]}}

Still useful — the supervisor itself uses cmux notify for terminal-state notifications (or stderr fallback when outside cmux), but the Stop sound is the secondary signal for “interactive Claude in front of you just finished its turn.” Pick any sound you don’t mind hearing many times a day.

Verify the install (optional, ~3 min):

cd ~/git/ALRubinger/aileron-tools
bats test/handoff.bats test/aileron-agent.bats   # 160+ unit tests, no container needed
bats test/image.bats                              # 16 tests, builds the image (5-10 min first time)

Then aileron-agent --help and aileron-agent status (run from any worktree; the latter exits 64 with “no chain in flight” — expected when nothing is running).

The basic loop — kickoff, walk away, get a notification

The supervisor takes care of everything between the brainstorm dialogue and the terminal state. In steady state (after you’ve opted into auto-merge via the soak ramp), there are two typed commands per task: kickoff and check-notification.

One pane to launch, then headless

┌──────────────────────────────────────────────────────────────────┐
│  cmux workspace: "fix-google-readme-typo"                        │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  Step 1 — foreground:  aileron-agent task fix-typo --idea  │  │
│  │            you converse with the brainstormer container    │  │
│  │            (TTY-attached; Claude + /ce-brainstorm)         │  │
│  │                                                            │  │
│  │  Step 2 — detach notice — chain running headless           │  │
│  │            planner → coder → reviewer → auto-merge gate    │  │
│  │            all run in the background                       │  │
│  │                                                            │  │
│  │  Step 3 — desktop notification at terminal state           │  │
│  │            done-merged  /  done-held-for-review  /         │  │
│  │            done-failed                                     │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘

Walkthrough — fixing a typo in a connector repo’s README as a concrete example.

Open a new cmux workspace named after the task (e.g., connector-google-readme-typo). The pane lands in a plain shell. Set up the worktree, then kick off the chain:

cd ~/git/ALRubinger/aileron-connector-google
git worktree add .claude/worktrees/readme-typo -b fix/readme-typo
cd .claude/worktrees/readme-typo

aileron-agent task fix-readme-typo --idea "Fix the README typo: 'implments' should be 'implements'. Trivial docs-only change."

The supervisor validates credentials, initializes its state file at .claude/agent-state/state.json, and takes over the pane with the brainstormer container. You converse with Claude inside (it has the aileron-pipeline wrapper + the CE plugin) until it writes docs/brainstorms/<slug>-requirements.md. For a trivial task that’s a question or two and you’re done.

The supervisor sees the requirements doc land, prints the detach notice:

task fix-readme-typo: brainstorm complete — chain running headless
log: .claude/agent-state/logs/fix-readme-typo-continue-<timestamp>.log

…and the foreground process exits. Your pane is yours again. You can close it, switch workspaces, or open aileron-agent status periodically to see where the chain is:

aileron-agent status
# slug=fix-readme-typo phase=code pr=42 last_exit=null time=14:23:01 log=...

In the background, the supervisor runs the planner (Opus, headless, --disallowedTools AskUserQuestion as the structural backstop), then the coder (which delegates to handoff — same container shape you knew before), then the reviewer (Sonnet, read-only worktree mount, posts a single non-approving PR comment with severity + count only). Then the auto-merge gate evaluates the deterministic signals — tests-green via gh api check-suites, CodeRabbit-clean (when present), path policy (every changed file must match the hardcoded docs/*, *.md, README*, etc. allowlist), and the soak threshold env var. All gates pass → squash-merge with --admin --delete-branch using the host-side merge token. Any gate fails → held-for-review with a clear reason.

You get a notification:

[notify] task fix-readme-typo: auto-merged — PR #42, commit abc1234

Done. The pane is still yours; close the workspace or move on.

For non-trivial work, the same shape holds: the brainstorm dialogue takes longer, the planner produces a meatier plan, the coder runs longer, but you don’t change anything you type. You kick off, walk away, get the notification.

During the soak period

Auto-merge is OFF by default. Until you set AILERON_AGENT_SOAK_THRESHOLD=0, every chain run that would otherwise auto-merge instead holds-for-review at the soak gate — even when tests are green, CodeRabbit is clean, and the path policy is satisfied. The notification reads:

[notify] task fix-readme-typo: held-for-review (soak ramp not enabled
         (set AILERON_AGENT_SOAK_THRESHOLD=0 to opt in); reviewer:
         severity=low, findings=1).

You then merge manually with gh pr merge <N> --squash --admin --delete-branch after spot-checking the PR.

Why this exists. The strategy doc’s central claim — that we can walk away during the headless phases — is a hypothesis, not a settled premise. Trust in the chain comes from observing it behave correctly over real runs, not from reading the code. The soak ramp is the place that hypothesis becomes testable: you opt in after seeing N successful runs of your choosing.

When to opt in. Recommendation: after 5–10 successful chain runs where you would have merged the result yourself anyway. Add to ~/.zshrc:

export AILERON_AGENT_SOAK_THRESHOLD=0

Restart your shell or source ~/.zshrc. The next chain run that passes all the other gates auto-merges.

You can also un-opt-in any time by unsetting the variable — there’s no persistent counter, no host state to clear. The threshold is binary by design (KTD7 in the plan): future iterations may interpret a non-zero number as a per-host run counter, but v0 doesn’t maintain one.

Parallel tasks — multiple workspaces

The bottleneck in the basic loop was the handoff wait when it was the whole user-facing surface. With the supervisor detaching after the brainstorm, that wait moves to the background — you can kick off the next task’s brainstormer while a previous task’s coder is still grinding inside a container. Parallelism comes from running multiple cmux workspaces, not multiple panes per workspace.

Multiple workspaces, glance across

┌─────────────────────────────┬─────────────────────────────┐
│ cmux workspace A            │ cmux workspace B            │
│ fix-google-readme-typo      │ slack-add-channel-action    │
│ status: chain detached ⏳   │ status: in brainstorm       │
├─────────────────────────────┼─────────────────────────────┤
│ cmux workspace C            │ cmux workspace D            │
│ bluebubbles-fix-race        │ hub-docs-refresh            │
│ status: held-for-review 🔔  │ status: chain detached ⏳   │
└─────────────────────────────┴─────────────────────────────┘

Open a new workspace per task. cmux’s sidebar surfaces the branch, PR status, and last cmux notify per workspace — you glance once across the four and see which want attention. The sound hook (Step 4) pings when any pane’s Claude finishes its turn — and the supervisor itself pings via cmux notify at each terminal state.

Practical pattern. Kick off task A’s brainstormer in workspace A; converse, see the detach notice, switch to workspace B; kick off task B’s brainstormer there; etc. Each chain’s background phases compete for Docker / API quota but don’t compete for your attention until they post a terminal-state notification. The strategy doc’s target is 4–5 parallel workspaces. Past 6 you’re paying for tokens faster than you can absorb the output — that’s the Gas Town problem the strategy doc warns about.

Honest gotchas

Things to know going in. Most are documented in the strategy doc’s Autonomy Invariants, the aileron-tools README, and the plan that landed the supervisor, but worth surfacing in one place.

Workflow

Foreground during brainstorm, background after

The brainstormer is the only phase you converse with — the supervisor TTY-attaches it to your pane. Once it writes the requirements doc, the supervisor detaches via nohup + disown and the foreground process exits. The rest of the chain (planner / coder / reviewer / merge gate) runs in the background; you find out it’s done via desktop notification. SIGHUP from a cmux pane close won’t kill the background process — the detached child re-installs its own cleanup traps on entry.

Workflow

Walks-away is a hypothesis, not a settled premise

The strategy doc’s claim that you can leave the terminal for >15 minutes during the headless phases is a hypothesis being measured, not a given. If you hover on the pane watching aileron-agent status rather than walking away, that’s a signal the bottleneck is trust (not orchestration) and the next iteration’s investment should go there. The soak ramp is the place this becomes testable across runs — by design, the measurement only starts after you opt into auto-merge.

Concurrency

One container per worktree, per phase

The supervisor’s container-conflict check refuses to launch any phase whose container name is already taken (aileron-agent-<slug>-<phase>). If aileron-agent resume finds a stuck container from a prior attempt, it exits cleanly with “stop the existing container” rather than colliding. Generalizes the older “one container per worktree” rule from handoff’s idempotency check — same Cognition invariant, applied per-phase.

Verification

Three Resolve-Before-Planning OQs

The brainstorm that produced the supervisor surfaced three high-risk substrate uncertainties that had to be verified before any U6+ code commits could land: (1) interactive Claude + AskUserQuestion renders over a TTY-attached docker run -it; (2) —model claude-opus-4-7 / claude-sonnet-4-6 strings resolve cleanly; (3) the aileron-pipeline wrapper skill triggers ce-plan’s pipeline mode end-to-end. Verification checklists land in aileron-tools/test/manual/u1-tty-smoke.md and u2-pipeline-verify.md; the U1+U2 gate is enforced in the plan’s Phase A intro.

Ergonomics

Shell pane vs. Claude pane

cmux panes now land in plain shell (the CMUX_SURFACE_ID-gated auto-launch-into-Claude snippet is gone). The supervisor is what brings up Claude — inside a container, only for the brainstormer phase, and only after you type aileron-agent task. If you want interactive Claude on the host (not in a container) for ad-hoc work, run claude directly. The supervisor only manages chain-state Claude; everything else is yours.

Timing

First container build is slow

5-10 minutes the first time the supervisor or handoff runs on a machine (or after —rebuild-image on the latter). The worker image pulls node:22 + golang:1.23, installs Claude Code + Codex + git + gh + golangci-lint + Task, git-clones the CE plugin, and bundles the four role configs + the aileron-pipeline wrapper plugin. Subsequent runs reuse Docker’s layer cache. Don’t kick off your first chain five minutes before a meeting.

Safety

The chain refuses to commit to main

Per family convention propagated through ~/git/ALRubinger/CLAUDE.md, the in-container Claude session creates a feature branch if the worktree is on main. If you set up the worktree on a named branch up front (git worktree add … -b fix/foo), the chain commits there directly — that’s the recommended flow. The auto-merge gate’s path policy + allowlist + soak threshold are additional belt-and-suspenders on top. Don’t fight any of them.

What’s still deferred

The supervisor closed three of the previously-listed deferrals (per-role configs, reviewer-on-PR, auto-merge for safe path classes). What remains:

Tracked

Passive watcher daemon

Drop a plan into a watched directory, the chain kicks off without typing aileron-agent task. Issue #30 on the ingress side; ties into the broader Ingress Manifests product strategy. v0 is explicit-command-only on purpose — the kickoff is the place the human declares intent.

Phase 5

Librarian + role-specific memory

After each merge, a librarian agent reconciles learnings from completed tasks back into docs/solutions/ (the institutional-memory directory CE expects). Per-role memory files (planner_memory, coder_memory, reviewer_memory) so each role only loads what it needs. This is where the stack starts to compound.

Plan deferral

Cost-ceiling enforcement

The supervisor’s state file tracks cost_tokens per phase + cumulative, but only observationally. Held-for-review on overrun (strategy invariant 6) needs a default-value mechanism, override semantics, and per-phase vs cumulative semantics designed up front. Deferred to a follow-up; invariant 6 stays open with the other v0 trust gaps until then.

Plan deferral

Per-repo auto-merge policy file

V0 ships ONE hardcoded path-policy glob list across all repos. A connector repo wanting different rules from a docs repo would need a per-repo override file. Decision punted: see what real per-repo divergence looks like across the first few months of soak-period merges before designing the override.

Quick reference

Steady-state, soak enabled:

# Once per task — kick off + walk away:
cd <worktree>
aileron-agent task <slug> --idea "<one-liner>"
# converse with brainstormer; chain detaches on first artifact
# (notification at terminal state — auto-merged or held-for-review)

During the soak period (before opting into auto-merge):

# Same kickoff command — but expect a held-for-review terminal state
# every time. Then manually:
gh pr merge <N> --squash --admin --delete-branch

Checking on / resuming a chain:

aileron-agent status            # current worktree's state
aileron-agent resume            # validate artifacts + re-dispatch from current phase

Headless kickoff (no brainstorm dialogue):

aileron-agent task <slug> --non-interactive --idea "<one-liner>"

That’s the whole loop. The strategy doc explains why we built it this way; this runbook explains how you actually use it. Pair them.

References

Multi-Agent Autonomous Development — the strategy doc that explains why
aileron-tools — the repo containing the supervisor, the role configs, the wrapper plugin, and the worker image
aileron-tools/README — install, full usage, token scope, v0 trust gaps, testing
aileron-tools#1 — initial handoff wrapper implementation
aileron-tools#2 — Go toolchain added to worker image
aileron-tools#3 — install.sh + symlink-safe bin/handoff
aileron-tools#4 — aileron-agent chain supervisor (U1–U16 of the multi-agent autonomous chain plan)
compound-engineering-plugin — the upstream plugin bundled in the worker image
manaflow-ai/cmux — the macOS terminal we use as the surface

This document, like its parent strategy doc, was produced via the loop it describes.