Back to blog

Spec-Driven Development with Claude Code

Spec-Driven Development with Claude Code

The developer who types "add user authentication" into Claude Code gets a different result every time. Maybe it's JWT. Maybe it's session cookies. Maybe it's a full OAuth2 flow with refresh tokens and PKCE. The agent doesn't know what you want because you haven't told it. You told it a direction, not a destination.

The developers I see getting consistent, shippable output from Claude Code share one habit: they write a spec before they hand work to the agent. Not a novel. Not a Jira ticket with three sentences of context. A concrete document that defines what "done" looks like before anyone writes a line of code.

This isn't new wisdom. Spec-first development predates AI by decades. But with agents, the cost of skipping the spec is higher and the payoff of writing one is larger. A human developer can stop mid-implementation and ask "wait, did you mean password auth or SSO?" An agent will pick one silently and keep going. By the time you notice, it's built the wrong thing, and you've spent 20 minutes reviewing code that needs to be thrown away.

This article walks through the spec-driven lifecycle I use with Claude Code every day: how to write specs that agents can execute against, the plan-before-code checkpoint that catches misunderstandings early, and a verification protocol that's more rigorous than "it compiles."

Why "Just Build It" Fails with Agents

Let's be specific about the failure mode. When you give Claude Code a vague instruction, three things go wrong:

Silent assumptions. The agent fills in every gap in your spec with its own assumptions. Sometimes those assumptions are reasonable. Sometimes they're not. You won't know which category you're in until you read the output. With vague instructions, you're reading the output more carefully than you would have spent writing a spec.

Non-reproducible results. Run the same vague prompt twice and you get two different implementations. Not just different variable names or formatting. Different architectural decisions. Different libraries. Different error handling strategies. If you can't reproduce the output, you can't build a reliable process around it.

Review becomes the bottleneck. When the agent makes all the decisions, you have to verify all the decisions. A 400-line diff where you understand every choice takes 5 minutes to review. A 400-line diff where the agent chose the database schema, the API shape, the error codes, and the validation logic takes 30 minutes because you're reverse-engineering the spec from the implementation.

The fix isn't better prompts. It's front-loading the decisions that matter into a document the agent can execute against.

The Spec-Driven Lifecycle

The workflow has five phases. Each one has a clear entry condition and a clear exit condition.

Phase 1: Brainstorm. You explore the problem space. What are the constraints? What approaches exist? What have you tried before? This is where you think out loud, either on your own or with Claude Code in conversational mode. The exit condition is: you have a preferred approach and understand the tradeoffs.

Phase 2: Review. You pressure-test the approach. What could go wrong? What edge cases exist? Does this conflict with anything already in the codebase? If you're working with multiple agents, this is where an architecture agent or a second opinion is valuable. The exit condition is: you're confident the approach is sound.

Phase 3: Spec. You write down what you decided. Problem statement, proposed approach, files to modify, acceptance criteria that can be mechanically verified, and a test plan. This is the contract. The exit condition is: someone (human or agent) could read this spec and know exactly what to build and how to verify it.

Phase 4: Implement. The agent executes against the spec. Not against a vague idea. Against a concrete document with testable criteria. The exit condition is: the agent claims it's done and has posted verification evidence.

Phase 5: Verify. You (or a QA agent) confirm the implementation matches the spec. Not "does it look right" but "does it satisfy each acceptance criterion." The exit condition is: every criterion is checked, and the ones that fail get sent back to Phase 4.

The key insight: phases 1-3 are cheap. They take 10-20 minutes for a medium-sized feature. Phase 4 takes however long the implementation takes. Phase 5 takes 5-10 minutes. Skipping phases 1-3 doesn't save 10-20 minutes. It costs you the time to review, debug, and redo work that went in the wrong direction.

What a Good Agent Spec Looks Like

Here's a real spec template. Not a user story. Not a product requirements doc. A working document that tells an agent exactly what to build.

## Problem
The filter bar resets when switching workspaces. Users lose their
filter state and have to re-apply filters every time they switch.

## Approach
Persist filter state per-workspace in localStorage. Key the stored
state by workspace database path so filters don't bleed across
workspaces.

## Files to Modify
- lib/local-storage.ts: Add getWorkspaceFilters / setWorkspaceFilters
- components/filter-bar.tsx: Read initial state from localStorage,
  write on every change
- hooks/use-workspace.ts: Trigger filter restore on workspace switch

## Acceptance Criteria
1. Select workspace A, set filters to status=open + type=bug
2. Switch to workspace B. Filters reset to defaults.
3. Switch back to workspace A. Filters restore to status=open + type=bug.
4. Close the browser tab, reopen. Filters for the active workspace
   are still applied.
5. bd list --status=open --type=bug output matches the filtered table.

## Out of Scope
- Server-side filter persistence
- Filter presets / saved filter combinations
- URL-based filter state (query params)

## Test Plan
- Unit test: getWorkspaceFilters returns stored value for matching
  workspace path
- Unit test: setWorkspaceFilters writes correct key format
- Manual test: steps 1-5 from acceptance criteria above

Notice what this spec does and doesn't contain. It doesn't explain how localStorage works. Claude Code knows that. It doesn't justify why we chose localStorage over URL params. That happened in the brainstorm phase. It does list every file the agent should touch, which means if the agent starts modifying files outside this list, that's a red flag. It does include an out-of-scope section, which prevents the agent from gold-plating.

The acceptance criteria are the most important part. Each one is a concrete action with an observable outcome. Not "filters should persist." That's ambiguous. "Switch to workspace A, verify filters are status=open + type=bug" is testable. An agent can execute that. A QA reviewer can verify it.

The Plan-Before-Code Pattern

Here's a protocol that catches the majority of implementation mistakes before they become code: require the agent to post its implementation plan before writing anything.

In practice, this looks like a structured comment on the task. Before the agent opens any file, it writes out exactly what it intends to do.

PLAN: Persist filter state per-workspace using localStorage.

1. Add workspace-scoped storage helpers to lib/local-storage.ts:
   - getWorkspaceFilters(dbPath: string): FilterState | null
   - setWorkspaceFilters(dbPath: string, filters: FilterState): void
   - Key format: "beadbox:filters:{dbPath}"

2. Update components/filter-bar.tsx:
   - On mount, read stored filters for current workspace
   - On every filter change, write to localStorage
   - Accept databasePath as prop (currently not passed)

3. Update hooks/use-workspace.ts:
   - On workspace switch, emit event that filter-bar listens for
   - Filter-bar reads new workspace's stored filters on event

4. Threading databasePath:
   - filter-bar needs databasePath from the workspace context
   - Will add it to the WorkspaceProvider context value

Files: lib/local-storage.ts, components/filter-bar.tsx,
       hooks/use-workspace.ts, components/workspace-provider.tsx

Test: Unit tests for storage helpers, manual test of all 5
acceptance criteria.

This takes the agent about 30 seconds to produce. It takes you about 2 minutes to read. And in those 2 minutes, you can catch problems that would take 20 minutes to fix after implementation:

  • Is the agent touching files outside the spec? (Adding workspace-provider.tsx wasn't in the spec. Is that OK or is it scope creep?)
  • Does the approach make sense? (Using an event emitter for workspace switches might be overengineered. A simpler prop change might work.)
  • Are there missing steps? (What about cleaning up stale localStorage entries when a workspace is removed?)

The plan is a checkpoint. If it looks right, tell the agent to proceed. If it looks wrong, correct the plan. Either way, you've spent 2 minutes instead of 20.

This is the problem Beadbox solves.

Real-time visibility into what your entire agent fleet is doing.

Try it free during the beta →

Verification Is Not "It Compiles"

The most common failure mode I see with Claude Code isn't that agents write bad code. It's that nobody verified the output against the spec.

The agent says "DONE." The developer glances at the diff, sees it looks reasonable, and merges. Two days later someone discovers the feature doesn't handle edge case #3 from the acceptance criteria. Now you're debugging in production instead of catching it during the 5-minute verification step.

Verification means mechanically checking each acceptance criterion. Not "it seems to work." Not "the tests pass" (tests can be wrong or incomplete). Each criterion from the spec gets a concrete check.

Here's what a proper completion report looks like:

DONE: Filter bar now persists selected filters across workspace
switches using per-workspace localStorage.

Changes:
- lib/local-storage.ts: Added getWorkspaceFilters/setWorkspaceFilters
  with key format "beadbox:filters:{dbPath}"
- components/filter-bar.tsx: Reads stored filters on mount, writes
  on change. Accepts databasePath prop.
- hooks/use-workspace.ts: Triggers filter restore on workspace switch
  via callback prop.

QA Verification:
1. Open http://localhost:41420, select workspace A
2. Set filters to status=open, type=bug
3. Switch to workspace B via header dropdown
4. Switch back to workspace A
5. Verify filters are still status=open, type=bug
   -> Confirmed: filters restore correctly
6. Close tab, reopen. Filters persist.
   -> Confirmed: localStorage key present, filters applied on mount
7. Run: bd list --status=open --type=bug
   -> Output matches filtered table contents (14 beads)

Acceptance criteria:
- [x] Filters persist across workspace switches (steps 2-5)
- [x] Filters survive browser restart (step 6)
- [x] Filtered view matches bd CLI output (step 7)
- [x] Filters don't bleed between workspaces (step 3: workspace B
      shows defaults)

Unit tests: 3 added (storage read/write/key format). All passing.

Commit: a1b2c3d

The difference between this and "DONE: Fixed the filter bar" is the difference between a 5-minute QA pass and a 30-minute investigation. Every claim in the DONE comment is backed by a specific check. Every acceptance criterion is mapped to a verification step. The reviewer knows exactly what was built, how it was verified, and where to look if something seems off.

Beads as a Spec Container

The lifecycle I just described needs a place to live. The spec, the plan comment, the implementation, the completion report, the verification results. All of it, attached to one task, in one place.

This is the problem beads solves. Beads is an open-source, local-first issue tracker designed for exactly this workflow. Each "bead" is a task with a description (your spec), a comment thread (plans and completion reports), a status (open, in_progress, ready_for_qa, closed), and metadata like priority, dependencies, and assignments.

Here's what the spec-driven lifecycle looks like in practice with the bd CLI:

Create the bead with your spec:

bd create --title "Persist filter state across workspace switches" \
  --description "## Problem
The filter bar resets when switching workspaces...

## Acceptance Criteria
1. Select workspace A, set filters...
2. Switch to workspace B..." \
  --type feature --priority p2

Agent claims the work and posts a plan:

bd update bb-a1b2 --claim --actor eng1
bd comments add bb-a1b2 --author eng1 "PLAN: Persist filter state
per-workspace using localStorage.

1. Add workspace-scoped storage helpers...
2. Update filter-bar component...
3. ..."

Agent completes the work and posts a done report:

bd comments add bb-a1b2 --author eng1 "DONE: Filter bar now persists
selected filters across workspace switches.

QA Verification:
1. Open http://localhost:41420...

Acceptance criteria:
- [x] Filters persist across workspace switches
- [x] Filters survive browser restart
...

Commit: a1b2c3d"

bd update bb-a1b2 --status ready_for_qa

QA picks it up and verifies:

bd show bb-a1b2  # Read the spec and the DONE comment
# Run the verification steps
bd comments add bb-a1b2 --author qa1 "QA PASS: All 5 acceptance
criteria verified. Filters persist, restore, and match bd CLI output."

The entire lifecycle is in the bead. The spec is the description. The plan is a comment. The completion report is a comment. The QA result is a comment. Six months from now, if someone asks "how does filter persistence work and why did we choose localStorage over URL params?", the answer is in the bead's comment thread.

When you're running one spec through this pipeline, a terminal and bd show is enough. But this workflow really shows its value when you're running multiple specs in parallel.

Scaling Spec-Driven Development

Picture the real scenario: you have three Claude Code agents, each implementing a different spec. Agent A is building a filter persistence feature. Agent B is adding a new API endpoint for workspace stats. Agent C is fixing a WebSocket reconnection bug. Each one is somewhere in the spec-driven lifecycle.

In the terminal, you'd need to run bd list to see all active beads, then bd show on each one to check its status and read the latest comment. That's six commands to get a snapshot of three parallel workstreams. Multiply that by five or ten agents and you're spending more time checking status than reviewing plans.

This is where Beadbox fits. Beadbox is a real-time dashboard that shows you the state of every bead in your workspace. Which specs are open and waiting for an agent. Which have plans posted that need your review. Which are in progress. Which are ready for QA verification. All updating live as agents write comments and change statuses through the bd CLI.

You don't need Beadbox to do spec-driven development. The CLI handles the entire lifecycle. But when you're running multiple spec-driven workflows in parallel, being able to see the pipeline at a glance rather than polling each agent's status individually changes how fast you can review plans, unblock agents, and catch stalled work.

Beadbox is free during the beta, and the beads CLI it runs on is open-source.

What Stays True Regardless of Tooling

Whether you use beads, GitHub Issues, Linear, or plain text files, the spec-driven pattern works because it addresses a fundamental asymmetry in how agents operate: they're fast at execution and bad at judgment. Every minute you spend writing a clear spec saves multiple minutes of reviewing incorrect output, debugging silent assumptions, and redoing work that went sideways.

The principles:

  1. Define "done" before "start." Acceptance criteria are not optional. They're the only thing that makes verification possible.

  2. Plans are checkpoints, not bureaucracy. A 30-second plan comment saves 20-minute rewrites. Review the plan, not the code.

  3. Verification is a protocol, not a vibe. "Looks good to me" is not verification. Mapping each acceptance criterion to a concrete check is verification.

  4. The spec is the single source of truth. When the implementation and the spec disagree, the implementation is wrong. This rule exists because agents won't question a bad plan. They'll execute it faithfully and produce faithfully wrong output.

  5. Scope boundaries prevent drift. An explicit list of files to modify and an out-of-scope section keep the agent from "improving" things you didn't ask it to improve.

The investment is small: 10-20 minutes writing a spec for a feature that takes an hour to implement. The return is large: consistent results, reviewable output, and a permanent record of what was built and why.

If you're building workflows like this, star Beadbox on GitHub.

Try it yourself

Start with beads for the coordination layer. Add Beadbox when you need visual oversight.

Free while in beta. No account required. Your data stays local.

Share