Back to blog

The Spec-First Claude Code Development Workflow

The Spec-First Claude Code Development Workflow

There is a widening gap between developers who get reliable output from Claude Code and developers who spend half their day undoing what the agent just built. The difference is not talent, experience, or some secret prompt engineering trick. It is a methodology question. The developers shipping production software with AI agents have converged on a pattern, whether they call it that or not: define what you want before the agent starts writing code.

This article gives that pattern a name. Spec-first development is a methodology for AI-assisted software engineering. Not a vague "best practice." A structured, repeatable lifecycle with defined phases, clear checkpoints, and concrete artifacts at every step. If you have been searching for a way to make Claude Code output predictable enough to bet your release schedule on, this is the framework.

The Vibe Coding Ceiling

"Vibe coding" entered the vocabulary in early 2025. The pitch: describe what you want in natural language, let the AI write it, iterate until it looks right. For prototypes, weekend projects, and one-off scripts, vibe coding works. You get something functional fast, and if it breaks later, the stakes are low.

Production software operates under different constraints. The code must integrate with an existing codebase, satisfy specific requirements, and survive contact with other people who will maintain it. When vibe coding meets these constraints, the failure modes are predictable.

The first failure is drift. You describe a feature loosely, the agent implements its interpretation, you adjust, the agent reimplements its adjusted interpretation. Three iterations later, you have working code that satisfies none of your original requirements because each iteration shifted the target. You are converging on what the agent thinks you want, not on what you actually need.

The second failure is invisible decisions. Every gap in your description is a decision the agent makes silently. Database schema, error handling strategy, API shape, validation rules, library choices. You discover these decisions during code review, or worse, in production. The agent did not make bad decisions. It made uninstructed decisions, and you had no mechanism to catch them before they were baked into the implementation.

The third failure is review paralysis. A 600-line diff where the agent chose the architecture, the data model, the error codes, and the edge case handling is not reviewable in the traditional sense. You are not reviewing code against a spec. You are reverse-engineering the spec from the code, then deciding whether you agree with it. This takes longer than writing the spec would have.

Vibe coding hits a ceiling because it conflates two distinct activities: deciding what to build and building it. Spec-first development separates them.

Spec-First as a Methodology

Spec-first development is a four-phase lifecycle. Each phase produces a concrete artifact. Each transition has a clear gate condition. The methodology works with any AI coding agent, but the examples in this article use Claude Code because that is where the community is iterating fastest.

Phase 1: Brainstorm

You and the agent (or just you) explore the problem space. What are the constraints? What approaches exist? What are the tradeoffs? This is conversational. You are not committing to anything. You are mapping the territory.

The gate condition: you have a preferred approach and you can articulate why this approach over the alternatives.

Brainstorming with Claude Code is valuable because the agent has broad knowledge of patterns and libraries. The mistake is jumping from brainstorm directly to code. The brainstorm surfaces options. It does not choose among them. You do.

Phase 2: Spec

You write down the decision. This is the contract the agent will implement against. A spec is not a user story, not a Jira ticket, not a paragraph of prose. It is a structured document with:

  • Problem statement: what is broken or missing, in concrete terms
  • Proposed approach: the chosen solution from the brainstorm phase
  • Files affected: which files the agent should touch (and implicitly, which it should not)
  • Acceptance criteria: testable conditions that define "done"
  • Out of scope: what the agent should explicitly avoid

The acceptance criteria are the most important element. Each one must be a concrete action with an observable outcome. "Authentication should work" is not a criterion. "Submitting valid credentials returns a 200 with a session token; submitting invalid credentials returns a 401 with no token" is.

The out-of-scope section prevents gold-plating. Without it, agents will "improve" adjacent code, refactor files they noticed were messy, or add features that seem related. Every minute the agent spends on unrequested work is a minute you spend reviewing unrequested work.

The gate condition: someone who was not in the brainstorm could read this spec and build the right thing.

Phase 3: Implement

The agent executes against the spec. Not against a conversation. Not against a memory of what you discussed. Against a concrete document with testable criteria.

Before writing any code, the agent produces a plan: a numbered list of changes it intends to make, which files it will modify, and how it will verify the result. This plan is a two-minute checkpoint. You read it, confirm it matches your intent, and green-light the implementation. Or you catch a misunderstanding and correct it. Either way, you have spent two minutes instead of twenty.

The plan-before-code pattern is not bureaucracy. It is the single highest-leverage intervention in the entire workflow. Most implementation mistakes are not coding errors. They are comprehension errors: the agent misunderstood the spec. Catching a comprehension error in a plan costs two minutes. Catching it in a 400-line diff costs twenty. Catching it in production costs a day.

The gate condition: the agent has posted a completion report with specific claims about what was built and how it was verified.

Phase 4: Verify

You or a QA process confirm the implementation against the spec. Not "does it look right?" but "does it satisfy each acceptance criterion?"

Verification is mechanical. You take each criterion from the spec, execute the test (run a command, open a browser, trigger an event), and record the result: pass or fail. Criteria that fail go back to Phase 3. The verification is documented alongside the implementation so that anyone who reads the task six months from now can see exactly what was tested.

The gate condition: every acceptance criterion has a recorded pass/fail result.

That is the complete lifecycle. Four phases, four artifacts (approach rationale, spec, implementation plan, verification record), four gate conditions. The phases are sequential but lightweight. For a medium-sized feature, phases 1 and 2 take 15-20 minutes. Phase 3 takes whatever the implementation takes. Phase 4 takes 5-10 minutes.

Why This Matters More with Agents Than Humans

Every argument for writing specs predates AI. "Write requirements before code" has been advice since before most of us were born. So why frame this as something specific to AI-assisted development?

Because agents change the cost function.

A human developer who receives a vague requirement will stop and ask questions. "Did you mean password auth or SSO?" "Should this work on mobile?" "What happens when the token expires?" Every question is a mini-checkpoint that nudges the implementation toward the correct target. The cost of a vague spec with a human developer is a few Slack threads and maybe an afternoon of rework.

An agent who receives a vague requirement will not stop. It will make every ambiguous decision silently, commit to an approach, and present you with a finished implementation. The cost of a vague spec with an agent is a finished implementation that may be entirely wrong, plus the time you spend discovering it is wrong, plus the time you spend redoing it.

The asymmetry is stark. Agents are faster at execution and worse at judgment than human developers. Every ambiguity in the spec is a judgment call, and every judgment call the agent makes without guidance is a coin flip on whether the result matches your intent. A spec eliminates coin flips.

There is a second, subtler reason. Agents do not push back. A senior engineer who receives a bad spec will say "this doesn't make sense because X." An agent will implement a bad spec faithfully and produce faithfully wrong output. Spec-first development forces you to pressure-test your own thinking before handing it to an entity that will execute it without question. The spec is not just for the agent. It is for you.

This is the problem Beadbox solves.

Real-time visibility into what your entire agent fleet is doing.

Try it free during the beta →

The Plan-Before-Code Checkpoint

If you take one practice from this article and ignore the rest, take this one.

Before the agent writes any code, require it to post an implementation plan. Not code. Not a diff. A structured outline of what it intends to do.

A plan looks like this: numbered steps in execution order, files to be modified, logic changes in each file, and the verification approach. The agent produces this in about thirty seconds. You read it in about two minutes. In those two minutes, you can catch:

  • Scope violations: the agent plans to modify files not listed in the spec
  • Architectural mismatches: the agent chose an approach that conflicts with existing patterns
  • Missing steps: the plan does not address an acceptance criterion
  • Overengineering: the agent plans to build abstractions that are not warranted

The 2-minute plan review replaces the 20-minute diff review where you discover these problems after they are already built. It is the cheapest quality gate in software engineering.

I wrote a detailed walkthrough of the plan-before-code pattern in Spec-Driven Development with Claude Code, including spec templates and completion report formats. This article focuses on why the pattern works; that one focuses on how to implement it.

Verification as a First-Class Step

The most underinvested phase in most developers' workflows is verification. The agent says "done." The developer glances at the diff. The merge happens. The bug surfaces two days later when a user hits edge case number three from the acceptance criteria.

Spec-first development treats verification as a formal step with its own artifacts. The completion report maps each acceptance criterion to a concrete check:

  • Criterion: "Switching workspaces restores the saved filter state."
  • Check: Open the app, set filters in workspace A, switch to workspace B, switch back to workspace A, observe that filters are restored.
  • Result: Pass.

This is not overhead. This is the step that determines whether the implementation actually satisfies the spec. Without it, the spec is a wishlist and the acceptance criteria are aspirational.

The verification record also solves a downstream problem: code review. When a reviewer opens the pull request, they read the spec, read the verification record, and review the diff with full context. Review time drops because the reviewer is confirming a verified claim, not conducting an investigation.

When you run multiple agents in parallel, each implementing a different spec, verification discipline is the difference between a controlled pipeline and a pile of code that "probably works." Each spec has criteria. Each implementation has a completion report. Each completion report maps criteria to checks. Nothing ships without recorded verification.

Objections and Honest Tradeoffs

Spec-first development is not free. The objections are real and worth addressing head-on.

"Writing specs slows me down." In isolation, yes. Writing a spec for a feature takes 15-20 minutes. But you recover that time (and more) in the implementation and review phases. An agent with a clear spec produces a correct implementation more often than an agent with a vague prompt. Fewer iterations, fewer rewrites, shorter reviews. The net effect for features of any substance is faster delivery, not slower.

For trivial changes (rename a variable, fix a typo, bump a version), specs are unnecessary overhead. Spec-first is for work where the implementation requires decisions. If the change is mechanical and unambiguous, skip the spec.

"My agent is good enough without specs." For some tasks, probably true. Claude Code is remarkably capable at inferring intent from brief descriptions. The question is not whether the agent can produce good output from vague instructions. It is whether it does so reliably. If you are comfortable with occasional rework and unpredictable review times, vibe coding may be sufficient for your use case. Spec-first pays off when consistency and predictability matter: when the feature is complex, when the code ships to production, when someone else will maintain it.

"Specs get stale." Valid concern. A spec written during brainstorming might not survive contact with reality. The fix is not to skip specs. It is to update the spec when the plan reveals new information. If the agent's plan shows that the approach in the spec will not work, revise the spec before proceeding. The spec is a living document for the duration of the implementation. It becomes a historical record after verification.

"This is just waterfall." No. Waterfall's failure was big specs for big projects with long feedback cycles. Spec-first development operates at the task level: one spec per feature or fix, written in 15-20 minutes, implemented in hours, verified the same day. The feedback loop is tight. The investment per spec is small. If the spec is wrong, you find out during the plan review, not six months later.

Tooling the Spec-First Lifecycle

The methodology works with any task system: GitHub Issues, Linear, Notion, plain text files. What matters is that the spec, plan, implementation notes, and verification results all live in one place, attached to one task.

If you are looking for a system designed for this workflow, beads is an open-source, Git-native issue tracker that holds the full lifecycle. Each "bead" carries a description (your spec), a comment thread (plans and completion reports), a status (open, in_progress, ready_for_qa, done), and metadata like dependencies and priorities. The bd CLI operates from the terminal, which means agents can read specs, post plans, and report completions without leaving their working environment.

bd create --title "Persist filter state across workspaces" \
  --description "## Problem ..." --type feature --priority p2

bd update bb-a1b2 --claim --actor eng1
bd comments add bb-a1b2 --author eng1 "PLAN: ..."

# After implementation:
bd comments add bb-a1b2 --author eng1 "DONE: ... Commit: a1b2c3d"
bd update bb-a1b2 --status ready_for_qa

The entire lifecycle happens in the CLI. Six months later, bd show bb-a1b2 returns the full history of what was specified, planned, built, and verified.

When you are running one agent through this lifecycle, the CLI is sufficient. When you are running five or ten in parallel, each at a different stage of the spec-implement-verify pipeline, you need to see the state of the pipeline at a glance. Beadbox is a real-time dashboard that shows which specs are open, which have plans waiting for review, which are in progress, which are blocked, and which are ready for verification. It monitors the same beads database the agents write to, updating live as statuses change.

You do not need Beadbox to practice spec-first development. The methodology is tool-agnostic. But when parallel workstreams turn your pipeline into a queue of tasks you cannot track from memory alone, the visual layer changes how fast you can review, unblock, and ship.

The Broader Shift

Spec-first development is not a reaction to AI coding agents being bad. It is a recognition that they are good at the wrong things without guidance. Agents are extraordinarily capable executors. They write correct syntax, follow patterns, handle boilerplate, and produce volume that no human can match. What they lack is the context to make good decisions about what to build. That context comes from you, and the spec is the vehicle.

The developers who will thrive in AI-assisted engineering are not the ones who write the best prompts. They are the ones who write the best specs. Prompts are ephemeral. Specs are durable. Prompts optimize for a single interaction. Specs optimize for a lifecycle: brainstorm, define, implement, verify.

This is not a temporary workaround until agents get smarter. Even as models improve, the fundamental asymmetry remains: the human knows what the business needs; the agent knows how to write code. A spec bridges the two. Better models will execute specs faster, but the need for the spec does not go away. It gets more important as you scale, because more agents running against vague instructions produce more divergent output.

If you have been running Claude Code agents and finding the results inconsistent, or spending too much time on review, or struggling to coordinate parallel workstreams, try this: before the next feature, take 15 minutes to write a spec with testable acceptance criteria, require the agent to post a plan before coding, and verify the output criterion by criterion. One cycle will show you the difference.

If you're building workflows like this, star Beadbox on GitHub.

Try it yourself

Start with beads for the coordination layer. Add Beadbox when you need visual oversight.

Free while in beta. No account required. Your data stays local.

Share