Multi-Agent Autonomous Development: Aileron's Internal Build Stack

The thesis

Aileron’s build velocity is bottlenecked by humans gating routine decisions. The fix is many specialized agents running in parallel: Claude plans and reviews, Codex writes, every agent runs inside a microVM with the safety prompts already turned off because the VM is the trust boundary. A typed task envelope crosses each handoff; the test suite is the only blocking gate. cmux is the macOS-native pane and notification surface we use to see what’s happening. Containers are the perimeter. Roles are how each agent knows its job. The constraint that makes it safe to run unattended — writes are single-threaded per worktree, full traces cross every handoff — comes from Cognition’s June 2025 post-mortem of the parallel-agent failures everyone else learned the hard way.

For the operational walkthrough — install, the per-task loop, parallel workspaces, honest gotchas — see the Multi-Agent Autonomous Development Runbook.

The role architecture

Six roles, three running by default. Each role is a CLI invocation + a role file + a microVM template.

Default role · Claude

Planner

Model: Claude Opus 4.7 at xhigh reasoning, fast-mode off (slow, deliberate, expensive). Tools: read-only by default. Output: a CE plan with Implementation Units, per-unit files, test scenarios, and verification — produced by /ce-plan, written to docs/plans/YYYY-MM-DD-NNN-<type>-<slug>-plan.md. Never edits source files. The role file is a tight CLAUDE.md override that strips coding affordances and amplifies architectural reasoning.

Default role · Codex

Coder

Model: GPT-5-Codex (or GPT-5.5) at model_reasoning_effort = “xhigh”, fast-mode on. Invocation: /ce-work-beta <plan-path>, which runs codex exec through our $SANDBOX_FLAG wrapper script (microVM substituting for —dangerously-bypass-approvals-and-sandbox) while preserving CE’s JSON output schema. Loop: read plan → edit files → run tests → open PR. Role file: AGENTS.md with strict guidance: “you receive a CE plan from a planner; do not redesign; if the plan is wrong, mark it plan-violation and exit.” Cheap tokens, fast diffs.

Default role · Claude

Reviewer (cross-provider, async)

Model: Claude Sonnet 4.6 (cheaper than Opus, still strong at audit). Mode: reads the diff in a fresh microVM with no write capability. Cross-provider review (Claude reviewing Codex output) is empirically harder to fool with sycophancy bias than same-provider self-review. Output: comments on the PR, structured severity ratings. Soft gate — does not block merge. Tests block merge.

Hard gate

Validator (the test suite)

The only blocking gate. task test in connector repos, go test ./… + project-specific checks elsewhere. Runs inside the coder’s microVM after the diff is staged, then again in CI on the pushed branch. A failing test means the coder loops back to its plan or escalates plan-violation to the planner. The reviewer is advisory; the validator is law.

Phase 2 role

Librarian (memory custodian)

Owns .claude/projects/<repo>/memory/ and per-role memory files. Reconciles new learnings from completed tasks back into project memory, dedupes, prunes stale entries. Read-mostly; writes only to memory paths. Runs after merge, not during the build loop. Justification: without this, every parallel agent re-learns the same lessons.

Phase 3 role

Integrator (merge wrangler)

When two coders’ branches conflict, the integrator rebases one onto the other, regenerates generated files (not hand-merges them), re-runs tests, force-pushes. Same patterns as ~/.claude/CLAUDE.md’s shepherd loop, generalized across parallel branches. Phase 3 because we need to first prove a single planner/coder/reviewer triplet works before adding a fourth role.

Topology — cmux is the surface, containers are the boundary

cmux gives us the macOS-native pane UX. Containers (microVMs by default for strongest isolation; OCI-compatible alternatives for portability) give us the trust boundary. They are different layers and we wire them together explicitly through a wrapper script in aileron-tools.

Layer 1

cmux workspaces

One cmux workspace per active development task. Inside the workspace, each role is a pane: planner pane, coder pane, reviewer pane, log/notification pane. The sidebar surfaces the git branch, PR status, and last cmux notify per workspace — so we glance once and see whether any agent wants attention. We do not rely on cmux for orchestration; it’s the surface.

Layer 2

git worktree per task

Each task gets its own worktree under .claude/worktrees/<task-slug> on a per-task branch. Single-threaded writes per worktree is the Cognition invariant: only one coder agent touches a given worktree. Reviewers and planners read across worktrees; writers stay inside theirs.

Layer 3

microVM per worktree

Each worktree mounts into a containerized environment chosen via the wrapper’s —container-runtime selector: Docker sbx microVMs (default, strongest local isolation), Dagger container-use, Podman, or any other backend the wrapper supports. Default-deny egress with an allowlist (registry.npmjs.org, api.anthropic.com, api.openai.com, github.com IPs, our own infra). Read-only host mounts except for the worktree path. No host credentials inside the container; OAuth tokens and API keys live on the host and are injected via the same boundary-injection pattern we sell in v4. Eventually the wrapper’s —mediation aileron selector loads Aileron-runtime as the in-container mediation layer — same shape we ship to customers, closing the customer-zero loop literally.

Layer 4

CLI in skip-permissions mode

Inside the container, agents run claude —dangerously-skip-permissions or codex exec —yolo —json. We turn the CLI’s own sandbox off on purpose because the container is the perimeter — double-sandboxing produces opaque permission errors (Codex’s docs are explicit on this). Outside a container, this would be unsafe. Inside, it’s the right setting.

The handoff protocol

Three handoffs in the default loop: planner → coder, coder → reviewer, reviewer → human (or auto-merge). All three use a plan-as-leash — CE’s plan.md written to disk and read in full, not summarized. Cognition’s failure mode — “lossy summaries across boundaries” — is what we are explicitly designing against.

Handoff 1

Planner → Coder

Planner runs /ce-plan, which writes a CE plan to docs/plans/YYYY-MM-DD-NNN-<type>-<slug>-plan.md with Implementation Units, per-unit files, test scenarios, and verification. Coder runs /ce-work-beta <plan-path>, which reads the plan in full and dispatches its units (with our microVM at $SANDBOX_FLAG). The full plan crosses the boundary — never a summary; CE polls its JSON output schema rather than a marker file.

Handoff 2

Coder → Reviewer

Coder opens the PR with the conventional-commit title and Summary/Test plan body. The reviewer subscribes to pull_request events on our repos (GitHub webhook → small dispatcher) and spawns a fresh review microVM with read-only mounts. Reviewer posts findings via gh pr review. Async, non-blocking on tests.

Handoff 3

Reviewer → Merge

Tests green + reviewer findings ≤ medium severity → coder auto-merges with gh pr merge —squash —admin —delete-branch (our family convention). Findings ≥ high → cmux notification to the human; merge held. Phase 0 keeps every merge under human review; phases below describe when we relax that.

Autonomy invariants — what makes it safe to run unattended

Invariant 1

The microVM is the perimeter

Every documented Claude Code skip-permissions incident (the Oct 2025 rm -rf / and the Dec 2025 home-directory wipe) happened outside a real isolation boundary. Inside a microVM, the worst case is a corrupted ephemeral filesystem we discard. Skip-permissions is safe only because of the layer below it.

Invariant 2

Single-threaded writes per worktree

Cognition’s rule, learned the expensive way. Multiple coders never write to the same worktree. The integrator role (phase 3) is the only place merges happen, and it serializes them.

Invariant 3

Full traces cross every handoff

Typed envelopes carry the planner’s complete reasoning, not a summary. The coder sees what the planner saw. The reviewer sees the planner’s envelope and the coder’s diff. The 17× error-amplification reports from bag-of-agents systems are downstream of summarization-at-boundaries; we avoid the failure by not summarizing.

Invariant 4

Tests are the only hard gate

Reviewer agents are advisory by design. A reviewer-as-blocker introduces a rubber-stamping risk (the reviewer learns to pass everything because nothing else does) and a deadlock risk (two LLMs disagree, no escalation path). Tests are the law. Reviewer findings inform human attention prioritization. How the supervisor operationalizes this: the reviewer phase calls gh pr review —comment (never —approve or —request-changes); the comment body shows severity + count only (never finding titles, to prevent the reviewer’s output from steering the human’s read of the PR); the auto-merge gate evaluates tests-green via gh api check-suites independently of any reviewer signal. The reviewer cannot block a merge; tests can.

Invariant 5

No host credentials inside the VM

The same shape Aileron sells: credentials live at the boundary, injected per request. The microVM holds API keys for the LLM provider it’s using — nothing else. GitHub tokens injected at git push time via the host’s gh CLI proxied into the VM. This is exactly the v4 credential-sealing model applied to our own dev loop.

Invariant 6

Cost ceiling per task

Steve Yegge’s Gas Town reportedly burns ~$100/hour at 12-30 parallel Claude agents. We set a per-task dollar ceiling (envelope field budget_usd) and a per-day workspace ceiling; the coder exits with budget-exhausted when crossed. Cheap to add early, expensive to retrofit.

Why Aileron is the right customer-zero

This isn’t tool-shopping. It’s eating our own dog food.

The v4 runtime we’re shipping — shell mediation that defines bash inside the container, credential sealing at the TLS boundary, async approvals on a PTY the agent can’t see, action contracts that declare effects and idempotency — is the substrate this entire build stack needs. Today we hand-roll microVMs, allowlists, and credential injection because the off-the-shelf pieces don’t compose cleanly. Every gap we hit while building the autonomous dev stack is a feature the v4 runtime should have. Customer-zero usage tells us which gaps matter most, and the gaps we close ship to customers as upgrades.

This is also the most credible demo we will ever give a Fortune-500 buyer. “We built our own product using our own runtime, running a dozen agents unsupervised, with credential sealing and action contracts as the trust contract.” The strategy doc on Skills lands harder when our own velocity says the runtime is real.

Rollout plan

Phased, with explicit exit criteria at each phase. We do not move to phase N+1 until phase N’s exit criteria are met. Phases 0–3 are now shipped; Phases 4–5 remain.

Phase 0 · shipped (aileron-internal#34)

Run the CE loop end-to-end with the microVM wedge in place

The loop is /last30days <topic> → /ce-plan → /ce-work-beta (with the microVM at $SANDBOX_FLAG) → tests → PR. The microVM is in place from day one — we are not running CE against the host and then “adding containers later,” because the trust boundary is what makes skip-permissions safe. The plan acts as the leash: it forces the coder to commit to an approach with acceptance criteria and stops it from quietly redesigning mid-build. Exit met: three CE-driven PRs landed — this strategy doc’s own Phase 0–2 rewrite (aileron-internal#34), the wrapper itself (aileron-tools#1), and the supervisor (aileron-tools#4).

Phase 1 · shipped (aileron-tools#1)

File-handoff to a containerized worker via handoff <plan>

CE’s /ce-brainstorm and /ce-plan phases ran on the host; handoff <plan-path> dispatched the work to a containerized Claude session running /ce-work-beta. Phase 2 and 3 absorbed and operationalized the host-side phases (see below), so handoff is now an internal primitive the supervisor’s coder phase delegates to, rather than a user-facing entry point. Honest about gaps: v0 is env-var credential injection (not per-request boundary injection); no egress allowlist; no in-container shell mediation. v1 swaps in aileron-runtime as the in-container mediation layer when it’s ready to host arbitrary agents — a backend flip, not a rewrite.

Phase 2 · shipped (aileron-tools#4)

Per-role configs + chain supervisor

The aileron-agent supervisor in aileron-tools/bin/aileron-agent chains four role-tuned containers per task — brainstormer (Opus, interactive TTY), planner (Opus, headless, —disallowedTools AskUserQuestion), coder (delegates to handoff), reviewer (Sonnet, read-only worktree mount, —output-format json). Role configs live at aileron-tools/roles/<role>/{CLAUDE,AGENTS}.md and are mounted read-only into each phase’s container at /opt/aileron-tools-roles/; the read-only mount is the trust boundary preventing in-container tampering. The aileron-pipeline wrapper-skill plugin (aileron-tools/plugins/aileron-pipeline/) routes pipeline-mode invocations to /ce-brainstorm or /ce-plan per AILERON_PHASE, suppressing the post-generation menus that would otherwise block headless runs. Exit met: the kickoff is aileron-agent task <slug> —idea ”…” — one command opens the brainstorm dialogue, the supervisor detaches after the requirements doc lands, the rest of the chain runs headless. The old ~/.zshrc cmux launcher snippet was removed (PR #45+#46) — cmux panes now land in plain shell, the supervisor brings up Claude only inside the brainstormer container.

Phase 3 · shipped (aileron-tools#4)

Reviewer + auto-merge for safe classes

The reviewer phase runs as part of the supervisor’s chain (not a separate webhook subscription) — same shape, simpler ops. Auto-merge fires when four deterministic gates pass: tests-green via gh api /repos/…/check-suites (bounded poll on pending; CodeRabbit filtered out of the tests-green set and evaluated separately); CodeRabbit-clean when present; path policy via gh pr diff —name-only against a hardcoded V0 allowlist (docs/*, *.md, README*, CHANGELOG*, Gemfile.lock, package-lock.json, go.sum); the soak threshold (AILERON_AGENT_SOAK_THRESHOLD=0 opts in). The merge token (AILERON_AGENT_MERGE_TOKEN) is host-side and never enters any container; it’s guarded by a default-deny repo allowlist (AILERON_AGENT_MERGE_REPO_ALLOWLIST). Reviewer findings are advisory only — the comment body shows severity + count, never finding titles — consistent with invariant 4 below. Exit met for landing the capability; the operational measurement of ”≥ 50% of PRs auto-merged with no production regressions” only starts after the user opts into the soak ramp.

Current phase · soak ramp

Build trust in the chain before flipping auto-merge on

Auto-merge is disabled by default: AILERON_AGENT_SOAK_THRESHOLD unset → every chain run that would otherwise auto-merge instead holds-for-review at the soak gate, and the operator merges manually. The threshold is a binary opt-in (set to literal 0 to enable, anything else stays off). The design choice is deliberate per lib/policy.sh KTD7: trust comes from observing the chain behave correctly across real runs, not from reading the code. The walks-away hypothesis (below) only becomes testable after the operator opts in. Exit criterion: the operator manually merges through 5–10 chains where they would have approved the same result, then sets AILERON_AGENT_SOAK_THRESHOLD=0; the next measured window is the first 10 chains under auto-merge.

Phase 4 · weeks 8+

Parallelize across worktrees

Run 4-5 parallel task workspaces, each with its own chain. Add the integrator role when conflicts start showing up. Measure throughput, merge rate, cost per merged PR. Exit: sustained 4-task parallelism for a week with cost under target.

Phase 5 · ongoing

Librarian + role-specific memory

Introduce the librarian after merge to reconcile new learnings into .claude/projects/<repo>/memory/. Split memory files by role (planner_memory, coder_memory, reviewer_memory) so each role only loads what it needs. Exit criterion is open — this is where the stack starts to compound.

Walks-away hypothesis — a premise being measured

The strategy doc’s recurring framing — that we save engineering time because operators can walk away during the headless phases — is a hypothesis being measured, not a settled premise. If operators hover on the pane watching aileron-agent status rather than walking away during the planner / coder / reviewer phases, automating the handoffs doesn’t free attention; it changes what’s hovered over. The signal for that failure mode is the soak ramp’s measurement window: of the first 10 chain runs after AILERON_AGENT_SOAK_THRESHOLD=0, on at least 5 the operator reports having left the terminal for >15 minutes during the headless phases without checking on it. Below 5/10, surface the result as evidence the bottleneck is trust (not orchestration) and redirect the next iteration’s investment toward visibility (better notifications, intermediate-state summaries, replay) rather than more automation. The metric is observer-aware and n=1; treat it as directional rather than statistical, but treat seriously what the direction implies.

The soak ramp + the walks-away measurement together make the recurring “save engineering time” claim falsifiable. Until that measurement runs, the headline number (“two typed commands per task in steady state”) describes mechanical compression, not freed attention — and the difference matters.

Open questions for review

Local microVMs vs Anthropic Managed Agents

Managed Agents (gVisor isolation, default-deny egress, scoped /workspace) is the production-grade hosted path. Tradeoff: lower setup cost vs. less control and a dependency on Anthropic’s plumbing. Recommendation: local for phase 0-2, evaluate Managed for phase 4+.

How aggressive on auto-merge

Per-repo policy file declares which paths are auto-mergeable (docs, action.md, README, generated files) and which require human review (runtime/, shell-mediation/, credential paths). Reviewer agent surfaces the policy decision in its findings.

Memory contention across parallel agents

If two agents in parallel workspaces both want to update project memory, who wins? Open question: serialize memory writes through the librarian (phase 5), or allow per-workspace memory branches that merge later?

What we are explicitly not doing

Not building cmux features. cmux is a primitive we use; if it’s missing capabilities (worktree management, multi-pane orchestration, cross-workspace cookies), we layer our own thin tooling rather than upstreaming.
Not building a new orchestration framework. AutoGen, MetaGPT, LangGraph, ralph-style bash loops all exist; we use file-based handoffs because they survive process restarts and are inspectable. The orchestration is filesystem + shell, not a framework.
Not centralizing the planner. Each task has its own planner instance with its own envelope; we are not building a “master planner” that decomposes the whole roadmap. Cognition’s Flappy Bird is exactly what happens when one planner fans out to many writers without shared context.
Not relaxing the v4 trust contract. Every invariant the v4 runtime gives our customers — credential boundary, shell mediation, async approvals — applies to our own dev loop. If we cut a corner internally, we’re admitting the corner can be cut for customers too.
Not optimizing for agent-hours. This stack exists to ship Aileron faster, not to maximize how many parallel agents are humming. If six agents are running and the output isn’t moving the product forward, that’s a signal to redirect, not to add a seventh. Agent throughput is a means; merged PRs that customers feel are the end.

Risks and what would invalidate this plan

Cost runs away faster than throughput grows

If Gas Town’s $100/hr at 12 agents is representative, the cost per merged PR could exceed the engineering cost it replaces. Phase 4’s exit criterion is “cost under target” — we set that target before phase 4 starts (proposal: $5 per merged PR median, $20 ceiling).

Reviewer agents rubber-stamp

Cross-provider review (Claude reviewing Codex) is more robust than same-provider, but the failure mode is real. Mitigation: periodic adversarial test PRs (intentional bugs) seeded by the human to verify the reviewer catches them. If catch rate drops below 80%, escalate or rotate models.

microVM ergonomics on macOS

Docker sbx is new (March 2026); driver issues on Apple Silicon are still being reported. Fallback plan: Lima + Dagger container-use if sbx proves unstable in phase 1.

cmux instability

~2,400 open issues against ~21k stars; fast-moving, rough edges. Risk to phase 2. Fallback: tmux + scripts; everything we build above the surface layer is cmux-agnostic on purpose, so the swap is mechanical.

Decision

Phases 0–3 shipped (aileron-internal#34, aileron-tools#1, aileron-tools#4). The active decision now is the soak ramp opt-in: the operator manually merges through 5–10 chain runs where they would have approved the same result, then sets AILERON_AGENT_SOAK_THRESHOLD=0 to enable auto-merge. The first 10 chain runs under auto-merge are the walks-away measurement window — if fewer than 5/10 result in the operator leaving the terminal for >15 minutes during the headless phases, the next iteration’s investment redirects toward visibility (better notifications, intermediate-state summaries) rather than additional automation. Phase 4 (parallelization) and Phase 5 (librarian) remain queued behind that measurement.

This document’s Phases 0–2 rewrite, the handoff wrapper (aileron-tools#1), and the chain supervisor (aileron-tools#4) were each produced via the loop they describe — /ce-plan produced the planning artifact in docs/plans/, /ce-work-beta (and later the supervisor itself) executed it. The strategy is its own first deliverable; the supervisor is the second.

References

mvanhorn’s practitioner playbook: Every Agentic Engineering Hack I Know — June 2026 (by @mvanhorn)
Aider Architect/Editor: aider.chat/2024/09/26/architect.html, edit formats
Cognition’s multi-agent post-mortem: Don’t Build Multi-Agents, Multi-Agents Working
Anthropic patterns: Building Effective Agents, Multi-Agent Research System, Auto Mode
cmux: manaflow-ai/cmux, cmux.com
Docker microVMs: Docker Sandboxes (sbx), microVM architecture
Dagger Container Use: dagger.io/blog/agent-container-use
Codex CLI: reference, sandboxing, non-interactive, agent approvals & security
Steve Yegge’s Gas Town: steve-yegge.medium.com/welcome-to-gas-town, github.com/steveyegge/gastown
Geoffrey Huntley’s Ralph: ghuntley.com/ralph, ghuntley.com/loop
Mitchell Hashimoto on agent harnesses: mitchellh.com/writing/my-ai-adoption-journey, Zed: Agentic engineering with Mitchell Hashimoto
Addy Osmani’s Code Agent Orchestra: addyosmani.com/blog/code-agent-orchestra
Simon Willison on Anthropic’s local sandbox: How we contain Claude across products