Back to blog

I Ship Software with 13 AI Agents. Here's What That Actually Looks Like

I Ship Software with 13 AI Agents. Here's What That Actually Looks Like

This is my terminal right now.

13 agent tmux session

13 Claude Code agents, each in its own tmux pane, working on the same codebase. Not as an experiment. Not as a flex. This is how I ship software every single day.

The project is Beadbox, a real-time dashboard for monitoring AI coding agents. It's built by the very agent fleet it monitors. The agents write the code, test it, review it, package it, and ship it. I coordinate.

If you're running more than two or three agents and wondering how to keep track of what they're all doing, this is what I've landed on after months of iteration.

The Roster

Every agent has a CLAUDE.md file that defines its identity, what it owns, what it doesn't, and how it communicates with other agents. These aren't generic "do anything" assistants. Each one has a narrow job and explicit boundaries.

Group Agents What they own
Coordination super, pm, owner Work dispatch, product specs, business priorities
Engineering eng1, eng2, arch Implementation, system design, test suites
Quality qa1, qa2 Independent validation, release gates
Operations ops, shipper Platform testing, builds, release execution
Growth growth, pmm, pmm2 Analytics, positioning, public content

The key word is boundaries. eng2 can't close issues. qa1 doesn't write code. pmm never touches the app source. Super dispatches work but doesn't implement. The boundaries exist because without them, agents drift. They "help" by refactoring code that didn't need refactoring, or closing issues that weren't verified, or making architectural decisions they're not qualified to make.

Every CLAUDE.md starts with an identity paragraph and a boundary section. Here's an abbreviated version of what eng2's looks like:

## Identity
Engineer for Beadbox. You implement features, fix bugs, and write tests.
You own implementation quality: the code you write is correct, tested,
and matches the spec.

## Boundary with QA
QA validates your work independently. You provide QA with executable
verification steps. If your DONE comment doesn't let QA verify without
reading source code, it's incomplete.

This pattern scales. When I started with 3 agents, they could share a single loose prompt. At 13, explicit roles and protocols are the difference between coordination and chaos.

The Coordination Layer

Three tools hold the fleet together.

beads is an open-source, Git-native issue tracker built for exactly this workflow. Every task is a "bead" with a status, priority, dependencies, and a comment thread. Agents read and write to the same local database through a CLI called bd.

bd update bb-viet --claim --actor eng2   # eng2 claims a bug
bd show bb-viet                           # see the full spec + comments
bd comments add bb-viet --author eng2 "PLAN: ..."  # eng2 posts their plan

gn / gp / ga are tmux messaging tools. gn sends a message to another agent's pane. gp peeks at another agent's recent output (without interrupting them). ga queues a non-urgent message.

gn -c -w eng2 "[from super] You have work: bb-viet. P2."  # dispatch
gp eng2 -n 40                                               # check progress
ga -w super "[from eng2] bb-viet complete. Pushed abc123."  # report back

CLAUDE.md protocols define escalation paths, communication format, and completion criteria. Every agent knows: claim the bead, comment your plan before coding, run tests before pushing, comment DONE with verification steps, mark ready for QA, report back to super.

Super runs a patrol loop every 5-10 minutes: peek at each active agent's output, check bead status, verify the pipeline hasn't stalled. It's like a production on-call rotation, except the services are AI agents and the incidents are "eng2 has been suspiciously quiet for 20 minutes."

A Real Day

Here's what actually happened on a Wednesday in late February 2026.

9:14 AM - A GitHub user named ericinfins opens Issue #2: they can't connect Beadbox to their remote Dolt server. The app only supports local connections. Owner sees it and flags it for super.

9:30 AM - Super dispatches the work. Arch designs a connection auth flow (TLS toggle, username/password fields, environment variable passing). PM writes the spec with acceptance criteria. Eng picks it up and starts implementing.

Meanwhile, in parallel:

PM files two bugs discovered during release testing. One is cosmetic: the header badge shows "v0.10.0-rc.7" instead of "v0.10.0" on final builds. The other is platform-specific: the screenshot automation tool returns a blank strip on ARM64 Macs because Apple Silicon renders Tauri's WebView through Metal compositing, and the backing store is empty.

Ops root-causes the screenshot bug. The fix is elegant: after capture, check if the image height is suspiciously small (under 50px for a window that should be 800px tall), and fall back to coordinate-based screen capture instead.

Growth pulls PostHog data and runs an IP correlation analysis. The finding: Reddit ads have generated 96 clicks and zero attributable retained users. GitHub README traffic converts at 15.8%. This very article exists because of that analysis.

Eng1, unblocked by arch's Activity Dashboard design, starts building cross-filter state management and utility functions. 687 tests passing.

QA1 validates the header badge fix: spins up a test server, uses browser automation to verify the badge renders correctly, checks that 665 unit tests pass, marks PASS.

2:45 PM - Shipper merges the release candidate PR, pushes the v0.10.0 tag, and triggers the promote workflow. CI builds artifacts for all 5 platforms (macOS ARM, macOS Intel, Linux AppImage, Linux .deb, Windows .exe). Shipper verifies each artifact, updates release notes on both repos, redeploys the website, and updates the Homebrew cask.

3:12 PM - Owner replies on GitHub Issue #2:

Good news: v0.10.0 just shipped with full Dolt server auth support. Update and you should be unblocked.

Bug reported in the morning. Fix shipped by afternoon. And while that was happening, the next feature was already being designed, a different bug was being root-caused, analytics were being analyzed, and QA was independently verifying a separate fix.

That's not because 13 agents are fast. It's because 13 agents are parallel.

What Goes Wrong

This is the part most "look at my AI setup" posts leave out.

Rate limits hit at high concurrency. When 13 agents are all running on the same API account, you burn through tokens fast. On this particular day, super, eng1, and eng2 all hit the rate limit ceiling simultaneously. Everyone stops. You wait. It's the AI equivalent of everyone in the office trying to use the printer at the same time, except the printer costs money per page and there's a page-per-minute cap.

QA bounces work back. This is by design, but it adds cycles. QA rejected a build because the engineer's "DONE" comment didn't include verification steps. The fix worked, but QA couldn't confirm it without reading source code. Back to eng, rewrite the completion comment, back to QA, re-verify. Twenty minutes for what should have been five. The protocol creates friction, but the friction is load-bearing. Every time I've shortcut QA, something broke in production.

Context windows fill up. Agents accumulate context over a session. Super has a protocol to send a "save your work" directive at 65% context usage. If you miss the window, the agent loses track of what it was doing.

Agents get stuck. Sometimes an agent hits an error loop and just keeps retrying the same failing command. Super's patrol loop catches this, but only if you're checking frequently enough. I've lost 30 minutes to an agent that was politely failing in silence.

The coordination overhead is real. CLAUDE.md files, dispatch protocols, patrol loops, bead comments, completion reports. For a two-agent setup, this is overkill. For 13 agents, it's the minimum viable structure. There's a crossover point around 5 agents where informal coordination stops working and you need explicit protocols or you start losing track of what's happening.

What I've Learned

Specialization beats generalization. 13 focused agents outperform 3 "full-stack" ones. When qa1 only validates and never writes code, it catches things eng missed every single time. When arch only designs and never implements, the designs are cleaner because there's no temptation to shortcut the spec to make implementation easier.

Independent QA is non-negotiable. QA has its own repo clone. It tests the pushed code, not the working tree. It doesn't trust the engineer's self-report. This sounds slow. It catches bugs on every release.

You need visibility or the fleet drifts. At 5+ agents, you can't track state by switching between tmux panes and running bd list in your head. You need a dashboard that shows you the dependency tree, which agents are working on what, and which beads are blocked. This is the problem I built Beadbox to solve.

The recursive loop matters. The agents build Beadbox. Beadbox monitors the agents. When the agents produce a bug in Beadbox, the fleet catches it through the same QA process that caught every other bug. The tool improves because the team that uses it most is the team that builds it. I'm aware this is either brilliant or the most elaborate Rube Goldberg machine ever constructed. The shipped features suggest the former. My token bill suggests the latter.

The Stack

If you want to try this yourself, here's what you need:

  • beads: Open-source Git-native issue tracker. This is the coordination backbone. Every agent reads and writes to it.
  • Claude Code: The agent runtime. Each agent is a Claude Code session in a tmux pane with its own CLAUDE.md identity file.
  • tmux + gn/gp/ga: Terminal multiplexer for running agents side by side. The messaging tools let agents communicate without shared memory.
  • Beadbox: Real-time visual dashboard that shows you what the fleet is doing. This is what you're reading about.

You don't need all 13 agents to start. Two engineers and a QA agent, coordinated through beads, will change how you think about what a single developer can ship.

What's Next

The biggest gap in the current setup is answering three questions at a glance: which agents are active, idle, or stuck? Where is work piling up in the pipeline? And what just happened, filtered by the agent or stage I care about?

Right now that takes a patrol loop and a lot of gp commands. So we're building a coordination dashboard directly into Beadbox: an agent status strip across the top, a pipeline flow showing where beads are accumulating, and a cross-filtered event feed where clicking an agent or pipeline stage filters everything else to match. All three layers share the same real-time data source. All three update live.

Activity Dashboard preview

The 13 agents are building it right now. I'll write about it when it ships.

Try it yourself

Start with beads for the coordination layer. Add Beadbox when you need visual oversight.

Free while in beta. No account required. Your data stays local.

Share