How to Do Code Review with AI

From AISApedia, the AI skills & terms encyclopedia

Code review with AI uses language models to supplement human code review by scanning for security vulnerabilities, inconsistent error handling, documentation gaps, and common bug patterns. AI reviewers excel at tireless enforcement of mechanical checks across large changesets, while human reviewers focus on architectural decisions, business logic correctness, and design intent — creating a complementary review process that catches more issues than either alone.

What does AI catch that human reviewers typically miss?

Human reviewers are subject to attention fatigue, recency bias, and focus narrowing — after reviewing several files of complex business logic, they are less likely to notice a SQL injection vulnerability in a utility function or inconsistent error handling across API endpoints. AI reviewers do not fatigue and apply the same scrutiny to the last file in a changeset as the first. This consistency is the tool's primary advantage.

Categories where AI code review is particularly strong include: input validation gaps (user-controlled data flowing into unsafe operations without sanitisation, a key concern in prompt security), error message information leakage (stack traces or system paths exposed to users in error responses), inconsistent null checking across related functions, missing edge case handling in conditional logic, and style or convention violations that individually seem minor but accumulate into technical debt. These are mechanical checks that benefit from exhaustive, pattern-based scanning across the entire changeset.

The corollary is that AI code review is weaker at evaluating whether the code solves the right problem, whether the abstraction boundaries are in the right place, and whether the implementation will be maintainable by the team six months from now. These judgments require understanding of team context, product direction, and organisational conventions that models do not reliably have. The strongest review process uses AI for mechanical coverage and humans for strategic judgment.

How should you structure prompts for effective AI code review?

Generic prompts like "review this code" produce generic feedback — a mix of obvious style comments and vague suggestions that experienced developers already know. Effective AI code review uses focused review passes, each with a specific lens, similar to how domain prompt templates encode expert checklists. A security pass might instruct: "Check every point where user input enters the system and trace it to where it is consumed. Flag any path where input reaches a database query, file operation, or log statement without sanitisation." A separate maintainability pass might focus on function length, naming clarity, and test coverage gaps.

Including project context improves review quality significantly. Tell the model what the code is supposed to do, what framework conventions it should follow, and what the deployment environment looks like. A review that knows the code runs in a multi-tenant environment will flag tenant isolation issues that a context-free review would miss. A review that knows the team uses a specific error handling convention will flag deviations that a generic review would accept. <a href="/aisapedia/domain-prompt-templates">Domain prompt templates</a> for code review encode this project-specific context in a reusable format.

For large changesets, break the review into logical units rather than submitting everything at once. AI models handle focused, single-concern reviews better than sprawling multi-file diffs. This is a practical application of task decomposition where the connections between changes are implicit. A 500-line diff across 20 files is more effectively reviewed as several focused passes — one per logical concern — than as a single monolithic prompt.

Where does AI review fit in the development workflow?

The most effective placement is as a pre-human-review gate. The AI review runs automatically when a pull request is opened, flagging mechanical issues before a human reviewer spends time on them. This means the human reviewer arrives to a changeset where the obvious issues are already annotated, allowing them to focus their attention on design, logic, and architecture — work that is both more valuable and more engaging.

Some teams use AI-integrated development environments like <a href="/aisapedia/cursor-ide">Cursor</a> or <a href="/aisapedia/github-copilot">GitHub Copilot</a> to catch issues during writing rather than during review. This shifts feedback even earlier in the development cycle, reducing the volume of issues that reach the review stage at all. The trade-off is that in-editor AI review operates with less context about the overall changeset and project state than a PR-level review.

A common antipattern is treating AI review as a replacement for human review. Teams that eliminate human reviewers in favour of AI-only review tend to accumulate architectural drift, business logic errors, and design inconsistencies that AI cannot detect. The tool works best as a force multiplier for human reviewers — handling the mechanical checks that are important but tedious, so human attention can be directed where it creates the most value.

How do you measure whether AI code review is adding value?

Track two categories of metrics: issue detection (what the AI found) and workflow impact (how the review process changed). For issue detection, log the number and severity of issues flagged by AI per review, the false positive rate (flags that were not actual issues), and the "escape rate" (issues that made it to production despite AI review). The false positive rate is critical — if reviewers learn to ignore AI comments because most are noise, the tool's value collapses even if it occasionally catches real issues.

For workflow impact, measure review cycle time (does AI pre-review reduce the time human reviewers spend?), rework rate (do PRs require fewer revision rounds?), and reviewer satisfaction (do human reviewers find the AI annotations useful?). These metrics ensure that the AI review is not just finding issues in theory but actually improving the team's development velocity and code quality in practice.

How should teams handle AI review false positives without losing trust in the tool?

False positives — flags on code that is actually correct — are the primary threat to AI code review adoption. When developers encounter repeated false flags, they begin dismissing AI comments reflexively, which means genuine issues get ignored alongside the noise. The solution is active false positive management: track which categories of AI comments are most frequently dismissed, and tune the review prompts to suppress those categories or increase their specificity.

A feedback mechanism where developers can mark AI comments as helpful or unhelpful creates the data needed for systematic tuning. This mirrors the iterative improvement cycle in feedback loop design. Aggregate these ratings weekly and adjust the review prompt: remove checks that consistently produce false positives, refine checks that flag too broadly, and double down on checks that consistently surface real issues. This iterative improvement mirrors the broader <a href="/aisapedia/feedback-loop-design">feedback loop design</a> principle — the review system should get more precise over time, not remain static.

Try this yourself

Paste your last commit into Claude or ChatGPT with this prompt: 'Review this code for security vulnerabilities, particularly around user input handling and data validation.' Fix at least one issue it finds before pushing.

Real-world example

Senior developer's PR looks clean, passes human review. AI spots: User-controlled data flows into a log statement without sanitization — potential log injection attack. Also finds three instances of error messages that leak system information. Both missed because reviewer focused on the clever algorithm.