AI Code Review
From AISApedia, the AI skills & terms encyclopedia
AI code review protocols are structured approaches to using language models for source code analysis, where focused review passes — each targeting a specific bug category — replace generic 'check this code' requests. By directing the AI to examine code through a single lens at a time (security vulnerabilities, logic errors, documentation accuracy, or performance issues), teams catch defects that unfocused review consistently misses.
Why does a generic 'review this code' request miss real bugs?
When asked to review code without a specific focus, language models distribute attention across style, structure, naming, documentation, logic, performance, and security simultaneously. The result is typically a shallow pass across all categories — suggestions to rename variables, add comments, or use more descriptive function names — while genuine defects in logic or security go unmentioned because the model allocated insufficient attention to any single concern.
This mirrors how human code review works under time pressure. A reviewer asked to 'look over this PR' will notice formatting inconsistencies and comment on naming conventions because those are cognitively cheap observations. Finding a race condition or an injection vulnerability requires sustained focus on a single concern, which is harder to maintain when attention is split across everything simultaneously.
Focused review passes solve this by constraining the model's attention to one category per pass. A security-focused pass that asks 'identify injection risks and input validation gaps' will examine every user input pathway, every database query, and every external API call — the specific code paths where injection vulnerabilities live. The model doesn't waste attention on variable naming because the prompt doesn't ask about it.
Research on attention mechanisms in transformer models helps explain why this works: the model's ability to attend to specific patterns improves when the task is narrowly defined. A focused prompt creates a clear retrieval target, making relevant code patterns more salient.
How should review passes be structured?
A practical protocol uses three to five passes, similar to prompt chaining, each with a specific lens and clear evaluation criteria. A security pass examines input validation, authentication checks, data sanitisation, and access control. A logic pass traces execution paths, looking for off-by-one errors, unhandled edge cases, null reference risks, and incorrect boundary conditions. A consistency pass compares the code's actual behaviour against its documentation, type signatures, and test assertions.
The order matters. Security should typically come first — a point explored in depth in this analysis of developer safety blind spots because its findings are highest-severity and most time-sensitive. Logic errors come next because they affect correctness for every user. Consistency and style passes come last because their findings, while valuable for maintainability, are lower-risk in production. Each pass should produce structured output — specific line references, severity ratings, and suggested fixes — not paragraph-form prose that the developer must parse.
Customising passes for the project's technology stack is important. A security pass for a web application should check for XSS, CSRF, and SQL injection. A security pass for a mobile app should check for insecure data storage, improper certificate validation, and hardcoded secrets. Generic security checklists produce noisy results; technology-specific passes produce actionable ones.
How does AI code review integrate with existing development workflows?
The most effective integration point is the pull request, often as part of a CI/CD pipeline. AI review runs automatically when a PR is opened, posting findings as inline comments on the relevant lines. This keeps the feedback in the same context where human reviewers work, so the AI's findings and human insights are visible side by side. Teams can configure the review to block merging on critical findings (security vulnerabilities) while treating other findings as advisory.
Teams that have adopted AI code review in practice report that it works best as a pre-review filter, not a replacement for human reviewers. The AI catches mechanical defects — missing null checks, unsanitised inputs, documentation drift — which frees human reviewers to focus on architectural decisions, business logic correctness, and design patterns that require understanding the broader system context. This division of labour makes both AI and human review more effective.
Configuring the review to match team conventions is essential for adoption. A security pass that flags every inline SQL query is noisy if the team uses a query builder that handles parameterisation internally. Review prompts should be calibrated to the project's actual patterns, frameworks, and conventions. Teams that invest time in this calibration report much higher signal-to-noise ratios and better developer acceptance of AI review feedback.
How do you measure whether AI code review is actually catching bugs?
Track two categories of metrics: detection metrics (what the AI finds) and prevention metrics (what the AI prevents from reaching production). Detection metrics include the number of findings per review, the severity distribution, and the false positive rate. Prevention metrics require comparing defect rates before and after AI review adoption — a longer-term measurement but ultimately more meaningful.
Confidence calibration through false positive rate is the most important metric for adoption. Developers quickly lose trust in a review tool that frequently flags correct code as problematic. A false positive rate above 20-30% typically leads to developers ignoring the tool entirely. Invest in reducing false positives (by improving prompts and calibrating to the codebase) before increasing detection sensitivity.
Track which categories of bugs the AI catches that humans miss, and vice versa. This data informs both the AI review configuration (focus the AI on categories where it outperforms humans) and the human review checklist (ensure humans focus on categories where they outperform the AI). Over time, this data builds a clear picture of where AI code review adds the most value for your specific codebase.
How should review protocols evolve as your codebase matures?
Review protocols should be living documents that adapt to the patterns, vulnerabilities, and architectural decisions specific to your codebase. A newly adopted framework introduces new vulnerability classes that the security pass should target. A refactor that introduces a new abstraction layer creates new consistency requirements that the logic pass should verify. Teams that treat their review protocol as static miss the drift between what the protocol checks and what the codebase actually needs checked.
Feedback loops from production incidents are particularly valuable for evolving review protocols. When a bug reaches production, trace it back to the pull request where it was introduced and ask: which review pass should have caught this, and why didn't it? The answer informs either a new check to add to an existing pass or a new pass category entirely. Over time, this incident-driven evolution makes the review protocol increasingly tailored to the specific failure modes your codebase is prone to.
Periodic review of false positive patterns is equally important. If a particular check consistently flags code that is actually correct, it erodes developer trust and wastes review time. Tuning the check to be more precise — or removing it and replacing it with a better-targeted alternative — keeps the signal-to-noise ratio high enough that developers take the review output seriously.
Try this yourself
Pick a function from your codebase (50-100 lines). Paste it into Claude with three separate review prompts: (1) 'Find security vulnerabilities, especially injection risks and input validation gaps,' (2) 'Identify logic errors, edge cases, and off-by-one bugs,' (3) 'Check if the code matches its docstring/comments.' Compare what each focused pass catches versus a single 'review this code' request.
Real-world example
A single 'review this code' request on a user registration function returned generic advice: 'Consider adding more comments.' Three focused passes found: (1) SQL injection via unsanitized email field, (2) race condition where two users could claim the same username, (3) docstring says 'returns user ID' but function actually returns the full user object. The focused approach found three real bugs the generic review missed entirely.
See also
- Statistical Validation with AIAdvanced
- UX Research SynthesisIntermediate
- Verification ChecklistsFoundational
- AI Code GenerationIntermediate
- Feature Engineering with AIAdvanced
- Roadmap AI AnalysisAdvanced
- Stakes-Based ReviewFoundational
- AI Output CategorisationIntermediate
