Skip to main content
AI

AI-Assisted Code Review: What Works, What Doesn't

Ravinder··8 min read
AICode ReviewEngineeringDevOpsDeveloper Tools
Share:
AI-Assisted Code Review: What Works, What Doesn't

Two Years In

We rolled out AI-assisted code review to our engineering organisation roughly two years ago. I can now give you an honest assessment that is not vendor marketing and not dismissive scepticism.

The short version: AI code review tools are genuinely useful for a narrow but high-value set of tasks, genuinely poor at another set, and quietly corrosive if you do not manage how they integrate with your human review process. The teams that got the most value were the ones that were deliberate about the division of labour. The teams that got burned were the ones that turned it on and hoped for the best.

This post is what I wish someone had written before we started.


The Review Taxonomy

To evaluate AI review tools honestly, you need to separate code review into categories. AI performs very differently across them.

mindmap root((Code Review)) Correctness Logic errors Edge case handling Algorithm correctness Null/undefined handling Security OWASP vulnerabilities Secrets in code Input validation Auth bypass patterns Design Architectural fit Abstraction quality Coupling and cohesion Naming and clarity Intent Does this solve the right problem Is the test meaningful Is this feature even needed Operational Performance implications Error handling completeness Observability hooks Configuration correctness

AI tools are excellent at Security and mechanical Correctness (null checks, resource leaks, common anti-patterns). They are decent at Operational concerns. They are poor at Design and essentially useless at Intent.

The teams that were disappointed had expected AI to perform across all categories. The teams that were satisfied had targeted it at Security and Correctness and kept humans responsible for Design and Intent.


What AI Does Well

Security vulnerability detection

This is the strongest category. The model has been trained on millions of CVEs and security advisories. It pattern-matches against known vulnerability classes reliably.

// AI correctly flags this as SQL injection risk
public List<User> searchUsers(String query) {
    String sql = "SELECT * FROM users WHERE name LIKE '%" + query + "%'";
    return jdbcTemplate.query(sql, userRowMapper);
}
 
// AI suggests:
// SQL injection vulnerability. User input directly concatenated into query.
// Fix: use parameterized query with PreparedStatement
// OWASP A03:2021 – Injection

In our benchmarks against a corpus of known-vulnerable code, AI tools caught 83% of OWASP Top 10 instances — higher than the human reviewer baseline for the same corpus. Critically, AI catches these consistently regardless of reviewer experience or how late in the sprint the review happens.

Hardcoded secrets and credentials

# AI flags immediately
DB_PASSWORD = "Sup3rS3cretP@ssw0rd!"
API_KEY = "sk-proj-abc123xyz"
 
# AI comment:
# Hardcoded credentials detected. This will be committed to version history.
# Move to environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault).

This class of finding is pure pattern matching. AI is faster and more reliable than humans for it.

Resource leak detection

// AI catches the missing close()
public String readConfig(String path) throws IOException {
    FileReader reader = new FileReader(path);  // Never closed
    BufferedReader br = new BufferedReader(reader);
    return br.readLine();
    // AI: FileReader not closed. Use try-with-resources.
}

What AI Does Poorly

Business logic correctness

AI does not know your domain. It cannot know whether the 15% discount rule applies to all users or only premium users unless it has that context. Even with context injection, it often gets domain rules wrong.

# AI sees no bug here
def calculate_refund(order, days_since_purchase):
    if days_since_purchase <= 30:
        return order.total * 0.9  # 10% restocking fee
    return 0
 
# But the product requirement is:
# Premium users get 100% refund within 30 days
# Regular users get 90% refund within 30 days
# AI has no way to know this rule exists

Business logic bugs require reading the ticket, understanding the domain model, and knowing the product requirements. These are human concerns.

Test quality assessment

This is the area where AI feedback is most misleading. AI will approve tests that pass but do not protect.

# AI: "Good test coverage for the calculate_discount function."
def test_calculate_discount():
    result = calculate_discount(100, 10)
    assert result is not None  # Passes. Means nothing.
 
# The meaningful test AI missed:
def test_calculate_discount_with_zero_price():
    with pytest.raises(ValueError):
        calculate_discount(0, 10)  # Should it raise? Return 0? AI doesn't know.

AI can count tests. It cannot judge whether the tests are actually testing the right things. A high-coverage test suite that only tests happy paths passes AI review with flying colours.

Architectural impact

AI reviews one PR at a time. It does not have a mental model of how the codebase has evolved, what the intended architecture is, or how this change fits into the larger system trajectory.

graph TD AI["AI Review Context"] Human["Human Review Context"] AI --> A1["This PR's diff"] AI --> A2["Surrounding code (limited context window)"] Human --> H1["This PR's diff"] Human --> H2["6 months of architectural decisions"] Human --> H3["3 related PRs from last week"] Human --> H4["Product roadmap context"] Human --> H5["Known technical debt in this area"] Human --> H6["Team coding standards not in writing"] style AI fill:#FEE2E2,stroke:#EF4444 style Human fill:#D1FAE5,stroke:#10B981

A change that looks perfectly reasonable in isolation might be adding a new pattern to a module that was supposed to be deprecated, or duplicating logic that was recently centralised elsewhere. AI misses this entirely.


The False Positive Problem

This is the most practically damaging issue with AI code review: false positives erode trust.

A false positive is a comment that flags something as a bug or concern when there is nothing wrong. Early in our rollout, our AI tool was generating 4-6 false positive comments per PR. Within three months, engineers had started dismissing all AI comments without reading them.

flowchart LR FP["High false positive rate"] --> Ignore["Engineers ignore AI comments"] Ignore --> Miss["Real bugs missed\n(signal lost in noise)"] Miss --> Worse["Worse outcome than no AI review"] style FP fill:#FEE2E2,stroke:#EF4444 style Worse fill:#FEE2E2,stroke:#EF4444

To avoid this:

  1. Tune confidence thresholds: Only surface comments above a confidence threshold. Better to miss a few bugs than to flood reviewers with noise.
  2. Categorise by severity: CRITICAL and HIGH findings are blocking. MEDIUM and LOW are non-blocking suggestions. Engineers pay attention to blocking comments.
  3. Measure false positive rate monthly: Track it. Set an SLA. If false positives per 100 comments exceed 15%, the tool needs tuning.

Integration Architecture

The pattern that works best is AI as a pre-screen before human review, not as a replacement.

sequenceDiagram participant Dev as Developer participant CI as CI Pipeline participant AI as AI Review Bot participant Human as Human Reviewer Dev->>CI: Open PR CI->>AI: Trigger AI analysis AI-->>Dev: Post findings as PR comments\n(security, patterns, obvious bugs) Dev->>Dev: Address AI findings\n(or dismiss with reason) Note over Dev: PR is cleaner before human eyes it Dev->>Human: Request human review Human->>Human: Focus on intent, design, domain logic Human-->>Dev: Approval or requests Note over AI,Human: AI handles scale; human handles judgement

The key insight: AI review should make human review better, not replace it. When AI has already caught the obvious issues, human reviewers spend their cognitive budget on the things that actually require a human.

Configuration that matters

# .github/ai-review.yml
rules:
  security:
    severity_threshold: HIGH          # Only post HIGH and CRITICAL security findings
    categories:
      - OWASP_INJECTION
      - HARDCODED_SECRETS
      - INSECURE_DESERIALIZATION
      - PATH_TRAVERSAL
 
  quality:
    severity_threshold: CRITICAL      # Only block on critical quality issues
    
  suggestions:
    post_as: NON_BLOCKING             # Style and minor improvements — visible but not blocking
    max_per_pr: 5                     # Cap suggestions to prevent noise
 
exclude_paths:
  - "**/*.generated.java"            # Don't review generated code
  - "**/vendor/**"
  - "**/__tests__/**"                # Human reviews test quality

Less is more. Start with security findings only. Add more categories once you have calibrated the false positive rate.


Measuring Impact

Track these metrics from week one:

AI Code Review Health Dashboard
═══════════════════════════════════════════════
Signal quality
  True positive rate:         87%  (target: >80%)
  False positive rate:        9%   (target: <15%)
  Engineer dismissal rate:    12%  (rising → investigate)
 
Impact
  Security bugs caught pre-merge:    +83%
  Review cycle time:                 -38%
  Human review comments per PR:      -29% (focused on intent)
 
Coverage
  PRs with AI review:          98%
  Findings acted on:           73%
  Findings marked false +ve:   9%
  Findings dismissed no reason: 18% ← investigate this
═══════════════════════════════════════════════

The "dismissed without reason" metric is the canary. When it rises, engineers are ignoring AI comments. When engineers ignore AI comments, you are paying for a tool that produces noise. Investigate and tune before trust collapses.


The Honest Summary

AI code review is a force multiplier for security and pattern-based quality concerns. It is not a replacement for human judgement on design, intent, and domain correctness. The teams that understand this distinction get real value. The teams that do not end up either over-relying on AI (and shipping logic bugs) or dismissing it entirely (and losing the security benefits).

The division of labour is simple: let AI handle the things it is reliably good at (security, resource management, obvious anti-patterns) and give human reviewers the space to focus on the things only humans can do (product correctness, architectural fit, test quality).

That combination is genuinely better than either alone.