Back to blog

AI Code Review Is Broken. Here's What's Missing.

AI Code Review Is Broken. Here's What's Missing.

Let's be honest about something most of us aren't saying out loud: we barely review AI-generated code.

The diff is 600 lines. The agent touched 14 files. You open the pull request, scroll through the changes, squint at a few functions, and merge. Maybe you run the tests first. Maybe you don't. The code looks reasonable. The agent said it's done. Ship it.

This isn't laziness. It's a structural problem. Traditional code review was designed for a world where a human wrote the code and could explain their reasoning when asked. Where diffs were 50-200 lines because that's how much a person writes in a focused session. Where the PR description said "I chose approach X because of Y" and you could trust that context.

AI agents don't work that way. Claude Code can produce 500 lines of working code in two minutes. The PR description is often just "implement feature X." The diff tells you what changed but nothing about why. No record of which alternatives the agent considered. No explanation of which tradeoffs it made. No evidence that it actually tested anything. You're reviewing the output of a black box, and the review tool only shows you the output.

This article breaks down why diff-based review fails for agent output, what the actual missing layer is, and a concrete pattern that makes AI-generated code reviewable without tripling your review time.

The Review Gap

Developers are honest about this in private. In community threads, the pattern repeats: "I mostly skim agent diffs." "I check the tests pass and merge." "If it looks roughly right, I approve it."

This is rational behavior given the constraints. When a human colleague submits a PR, you have context: you know what they were working on, you've seen the ticket, you might have discussed the approach at standup. The diff is supplementary. The real review happened through shared context.

With an agent, you have none of that. The agent claimed a task, went silent for three minutes, and produced a diff. The only context is whatever one-line description the agent left on the PR. Reviewing that diff from scratch, with no context about intent or reasoning, takes 5-10x longer than reviewing a human's PR of the same size. So people don't. They spot-check a few critical areas and approve.

The result is predictable. Bugs slip through that would have been caught with context. Architectural drift accumulates as agents make small decisions that compound. Code quality degrades in subtle ways: not broken, but not quite right either. Inconsistent error handling in one file. A database query that works but scales poorly. A new utility function that duplicates an existing one because the agent didn't know it existed.

None of these failures are visible in a diff review. They're only visible when you understand what the agent was trying to do and can compare its intent against its execution.

Why Diffs Aren't Enough

A diff is a record of textual changes. That's it. For human code review, diffs work because the reviewer can infer intent from pattern recognition and shared context. You see a colleague add a try/catch block and you know they're handling the error case from last week's bug report. You see them rename a function and you know they're following the naming convention the team agreed on.

With agent-generated code, you can't infer intent because you weren't part of the reasoning process. Here's what a 500-line agent diff actually tells you:

  • Which files were modified
  • What lines were added, changed, or removed
  • The syntactic structure of the new code

Here's what it doesn't tell you:

Why this approach was chosen. The agent might have considered three different implementations. It picked one. You don't know why. Maybe the one it picked is optimal. Maybe it's the first thing it tried and it worked well enough. You can't tell from the diff.

What alternatives were discarded. If the agent chose a polling strategy over WebSocket subscriptions, was that a deliberate architectural decision or an accident? A diff doesn't say.

Whether the implementation matches the spec. You'd need to open the spec in one window and the diff in another and manually cross-reference each acceptance criterion. Most people don't.

What was tested and how. The diff might include new test files. But did the agent run them? Did they pass? Do they cover the edge cases from the spec? You'd need to check out the branch and run them yourself to know.

Whether the agent stayed in scope. Maybe the task was "fix the login bug" and the agent also refactored the auth middleware, renamed two utility functions, and updated the config schema. All of those changes look fine in isolation. But they weren't asked for, they weren't spec'd, and they weren't tested against the original acceptance criteria.

This isn't a problem with any particular diff tool. GitHub's review interface, GitLab's merge requests, Gerrit, git diff in the terminal. They all show you the same thing: what changed. For agent output, what changed is the least important question. The important question is: does this change do what it was supposed to do, and nothing else?

The Missing Layer: Implementation Narrative

What reviewers actually need is the agent's reasoning trail. Not the diff. The story of the implementation: what the agent planned to do, what it actually did, and how it verified the result. Call it an implementation narrative.

A good implementation narrative answers five questions:

  1. What was the plan? Before writing code, what did the agent intend to do? Which files, which approach, which order of operations?
  2. What happened during implementation? Did the plan survive contact with the codebase? Were there surprises, pivots, or scope changes?
  3. What was the final result? Not the diff. A plain-language summary of what changed and why.
  4. How was it verified? Specific steps the agent took to confirm the implementation works. Not "tests pass" but "I ran acceptance criterion #3 by doing X and observed Y."
  5. What should the reviewer check? The agent's own recommendation for what deserves human attention. Maybe there's a design decision that could go either way, or a performance tradeoff worth a second opinion.

None of this exists in a standard PR workflow. The PR description field is free text that nobody enforces. Agent PRs default to minimal descriptions because the agent was told to implement, not to document its reasoning.

The gap isn't tooling. It's process. The review infrastructure exists. What's missing is a structured record of agent intent that the reviewer can check against the diff.

This is the problem Beadbox solves.

Real-time visibility into what your entire agent fleet is doing.

Try it free during the beta →

The Plan-Comment-Done Pattern

Here's a pattern that closes the gap without adding significant overhead. It has three parts: the agent comments a plan before it writes code, it implements, and it comments a structured done report when it finishes.

Step 1: The Plan Comment

Before the agent opens any file, it writes out what it intends to do. Numbered steps, files it will touch, and the approach it will take.

PLAN: Fix WebSocket reconnection dropping messages during
server restart.

1. Add a message buffer to hooks/use-websocket.ts that queues
   outbound messages while the connection is in CONNECTING state
2. On successful reconnection, flush the buffer in order
3. Add a 30-second timeout: if reconnection hasn't succeeded,
   surface an error to the user via the toast system
4. Update the existing reconnection test to verify buffer
   behavior

Files: hooks/use-websocket.ts, components/connection-status.tsx

Test: Unit test for buffer queueing/flushing, manual test by
killing the WS server mid-operation and verifying no messages
are lost on reconnect.

This takes the agent about 30 seconds to produce. The reviewer reads it in under 2 minutes. In those 2 minutes, you catch problems that would take 20 minutes to find in a diff review:

  • Is the scope right? (The plan says 2 files. If the diff touches 6, something drifted.)
  • Is the approach sound? (A message buffer with a flush-on-reconnect is reasonable. If the agent had proposed re-fetching all state from the server, you'd want to discuss that first.)
  • Are there missing steps? (What about messages that were sent during the timeout window? Should they be dropped or retried?)

The plan is a lightweight checkpoint. Review it, approve it, and the agent proceeds with shared understanding of the intent.

Step 2: Implementation

The agent writes the code. Nothing changes here. But because the plan exists, the agent has a roadmap it can follow rather than making ad hoc decisions as it goes.

Step 3: The Done Comment

When the agent finishes, it posts a structured completion report. Not "done." Not "implemented the feature." A detailed account of what it actually did.

DONE: WebSocket reconnection now buffers outbound messages
during CONNECTING state and flushes on successful reconnect.

Changes:
- hooks/use-websocket.ts: Added messageBuffer ref, queue
  messages when readyState !== OPEN, flush in FIFO order on
  'open' event. Added 30s reconnection timeout.
- components/connection-status.tsx: Shows "Reconnecting..."
  badge during buffer state, switches to error toast after
  timeout.

QA Verification:
1. Start dev server and WebSocket server
2. Open app, verify connection dot is green
3. Kill WebSocket server process
4. Trigger an action that sends a WS message (e.g., update
   a bead status)
5. Restart WebSocket server within 30 seconds
6. Verify: the buffered message is delivered, bead status
   updates in the UI
7. Repeat step 3, but wait >30 seconds before restart
8. Verify: error toast appears after timeout

Acceptance criteria:
- [x] Messages sent during reconnection are not lost (step 6)
- [x] Timeout surfaces user-visible error (step 8)
- [x] No behavior change when connection is stable (step 2)

Commit: f4e2a1b

Now the reviewer has everything they need. They read the plan (what was intended), read the done comment (what was actually built and how it was verified), and then look at the diff with full context. The diff review goes from "what is all this?" to "let me confirm this matches what the agent said it did."

Review Checklists for Agent Output

Even with the implementation narrative, you need a systematic approach. Here's a checklist I use when reviewing Claude Code output. It takes 5-10 minutes per review and catches the categories of bugs that diffs alone miss.

Spec alignment:

  • Does the implementation address every acceptance criterion from the spec?
  • Are there changes that go beyond what the spec asked for?
  • Does the done comment map each criterion to a verification step?

Scope containment:

  • Did the agent only modify files listed in its plan?
  • If it touched additional files, is there a stated reason?
  • Are there "cleanup" changes (renames, reformats, refactors) that weren't part of the task?

Test coverage:

  • Do new tests exist for new behavior?
  • Are the tests actually testing the right thing? (Agents sometimes write tests that pass trivially because they test the mock, not the implementation.)
  • Did the agent claim it ran the tests? Is there evidence?

Architectural consistency:

  • Do the changes follow existing patterns in the codebase?
  • Are there new abstractions that duplicate existing ones?
  • Does the error handling strategy match the rest of the project?

Dependency awareness:

  • If the agent added dependencies, are they justified?
  • Do the changes break any existing functionality? (Check files that import the modified modules.)
  • If the task has dependencies on other tasks, are those dependencies resolved?

This checklist works with any code review tool. Print it on a sticky note, keep it in your PR template, or build it into your CLAUDE.md so the agent knows what standard it's being held to. The point isn't the specific items. It's having a structured protocol instead of "looks good to me."

Beads as a Review Surface

The plan-comment-done pattern needs a place to live. If plans and done comments are scattered across Slack messages, PR descriptions, and terminal output, you lose the connection between the spec, the plan, the implementation, and the verification.

This is the problem beads solves. Beads is an open-source, Git-native issue tracker where each "bead" carries the entire lifecycle of a task: the spec as the description, agent plans as comments, done reports as comments, and QA results as comments. All attached to one entity, searchable, and permanent.

Here's what the review workflow looks like with the bd CLI:

Create the task with the spec:

bd create --title "Fix WebSocket reconnection message loss" \
  --description "## Problem
Messages sent during WebSocket reconnection are silently
dropped...

## Acceptance Criteria
1. Messages queued during CONNECTING state are delivered
   on reconnect
2. 30-second timeout surfaces error to user
3. No behavior change when connection is stable" \
  --type bug --priority p1

Agent claims the work and posts a plan:

bd update bb-f4e2 --claim --actor eng1
bd comments add bb-f4e2 --author eng1 "PLAN: Add message
buffer to WebSocket hook...

1. Queue outbound messages when readyState !== OPEN
2. Flush buffer in FIFO order on 'open' event
3. Add 30s timeout with error toast
4. Update reconnection test

Files: hooks/use-websocket.ts, components/connection-status.tsx"

You review the plan in 2 minutes:

bd show bb-f4e2  # Read spec + plan comment

If the plan looks right, the agent proceeds. If it doesn't, you comment back with corrections before any code is written.

Agent completes and posts a done report:

bd comments add bb-f4e2 --author eng1 "DONE: WebSocket
reconnection now buffers outbound messages...

QA Verification:
1. Kill WS server, trigger action, restart within 30s...

Acceptance criteria:
- [x] Buffered messages delivered on reconnect
- [x] Timeout error visible
- [x] No regression on stable connection

Commit: f4e2a1b"

bd update bb-f4e2 --status ready_for_qa

QA verifies independently:

bd show bb-f4e2  # Read the done comment's verification steps
# Execute each step
bd comments add bb-f4e2 --author qa1 "QA PASS: All 3 criteria
verified. Buffer flushes correctly, timeout fires at 30s,
stable connections unaffected."

The entire review trail is in one place. Six months later, when someone asks "why does the WebSocket buffer messages during reconnection?", the answer is in the bead: the spec explains the problem, the plan explains the approach, the done comment explains what was built, and the QA comment confirms it works.

When Terminal Review Hits Its Ceiling

Running bd show on one task gives you everything. But when you're reviewing multiple agents' output across several parallel workstreams, the CLI workflow scales linearly: one bd show per task, one bd list to see what's ready for review, one bd show per plan you need to approve.

This is where Beadbox fits. Beadbox is a real-time dashboard that shows you every task in your workspace with its current status, latest comment, and position in the review pipeline. You see which agents have posted plans that need your approval. Which have posted done reports ready for your review. Which are still in progress. All updating live as agents write comments and change statuses through the bd CLI.

You don't need Beadbox to use the plan-comment-done pattern. The CLI handles the full workflow. But when you have five agents producing reviewable output simultaneously, being able to see the review queue at a glance instead of polling each task individually changes how fast you move through the pipeline.

Beadbox is free during the beta, and the beads CLI it runs on is open-source.

The Review Problem Won't Solve Itself

AI-generated code is increasing faster than our ability to review it. The tools we have were built for a different scale and a different workflow. GitHub PRs, IDE diffs, even sophisticated static analysis: none of them address the fundamental problem, which is that reviewing code without knowing the author's intent is dramatically harder than reviewing code with it.

The fix isn't better diff tools. It's structured intent: a record of what the agent planned to do, what it actually did, and how it verified the result. The plan-comment-done pattern gives you that record without adding significant overhead. The agent spends 30 seconds writing a plan. You spend 2 minutes reviewing it. The agent spends 60 seconds writing a done report. You review the diff with full context instead of from scratch.

Five principles to take away:

  1. Require plans before code. A 30-second plan comment saves 20-minute review sessions. If the plan is wrong, fix it before the code exists.

  2. Demand structured done reports. "Done" is not a done report. Verification steps, acceptance criteria mapping, and commit hashes are a done report.

  3. Review against the spec, not the diff. The diff shows what changed. The spec says what should have changed. Cross-reference them.

  4. Enforce scope boundaries. If the agent touched files outside its plan, that's a flag. Unplanned changes are unreviewed changes.

  5. Treat review as a protocol, not a judgment call. A checklist catches more bugs than intuition. "Looks good to me" is not a review.

The agents will keep getting faster. The diffs will keep getting larger. The question is whether your review process scales with them, or whether you're still squinting at 600-line diffs and hoping for the best.

If you're building workflows like this, star Beadbox on GitHub.

Try it yourself

Start with beads for the coordination layer. Add Beadbox when you need visual oversight.

Free while in beta. No account required. Your data stays local.

Share