Code Review at Scale: What Changed When the Team Hit 50
We had a beautiful review culture when we were twelve engineers. Every PR got two thoughtful reviewers, discussion was spirited, and nothing shipped without someone catching the logic bug in the edge case. Then we hired. A lot. By the time we crossed fifty engineers, the review queue had become the single biggest drag on cycle time — and the engineers doing most of the reviewing were quietly burning out.
The problem wasn't that people stopped caring. The problem was that the process designed for a room of twelve was now being stretched across six time zones, four product squads, and a codebase that no single person understood end-to-end anymore. This post is about what we changed, what we tried that failed, and the specific mechanics that made a real difference.
Why Review Bottlenecks Happen at Scale
The math is uncomfortable. At 10 engineers with 2 required reviewers per PR, the average engineer is a reviewer on roughly 2 PRs per day. Add 40 more engineers, keep the same policy, and that number climbs past 8 — before accounting for the fact that senior engineers are chosen disproportionately. The seniors you most want shipping features end up spending half their day in review queues.
There's also a cognitive load dimension. Reviewing a PR well requires context: what the feature is supposed to do, what the existing abstractions look like, which invariants matter. When a reviewer lacks that context, they either rubber-stamp (high risk) or deep-dive to rebuild it (high cost). At small scale you share enough context passively. At scale, that ambient knowledge disappears.
The Async-First Shift
The most impactful change was not technical — it was cultural: we made asynchronous review the default, not a fallback.
Synchronous review (pinging someone on Slack, scheduling a 30-min walk-through) feels faster in the moment. It isn't. It depends on both people being free at the same time, it leaves no written record for future readers, and it creates a dependency on individual availability that breaks the moment someone is on leave.
We wrote an explicit norm: you may not block a PR on synchronous discussion unless the PR is emergency-critical. Everything else gets resolved in PR comments with a 24-hour SLA. We also mandated that comment threads be closed (either addressed or explicitly deferred) before merge — no more "we discussed this in Slack" as a justification for removing a comment.
The discipline around written review threads paid dividends that weren't obvious at first. Six months later, when a junior engineer asked why a particular abstraction existed, the answer was a PR comment — not a tribal memory that had left the company.
CODEOWNERS as Accountability Infrastructure
GitHub's CODEOWNERS file is underused at most companies. When configured properly, it does two things: it auto-assigns reviewers who have context, and it makes ownership explicit and auditable.
Our initial mistake was treating CODEOWNERS as a giant list of "who can review anything." It needs to be granular.
# Payment processing: all payment-related changes need both a payment team member
# and a security reviewer
/services/payments/ @arika/payments-team @arika/security
/services/payments/api/ @payment-api-lead
# Infrastructure changes always go to platform
/infra/ @arika/platform-team
# Shared libraries need library maintainers
/packages/shared-ui/ @ui-foundation-team
# Generated files — no review required
/generated/ @arika/no-review-botsThe key discipline: ownership is assigned at the team level, not the individual level. Individuals rotate in and out of team membership. Assigning to individuals means CODEOWNERS entries go stale every time someone changes roles.
We also introduced ownership rotation. Every quarter, one engineer from each team rotates into the ownership of an adjacent team's critical path — payment engineers rotate through the API gateway, API engineers rotate through the data pipeline. The goal isn't deep mastery; it's enough familiarity to catch "this shouldn't be touching that" problems and reduce the knowledge silos that make reviews shallow.
AI Pre-Review: What It Actually Catches (and Misses)
We integrated an AI pre-review step using a self-hosted model running on every PR before human reviewers are notified. After six months of use, here's a candid assessment.
What AI pre-review catches reliably:
- Obvious null dereference and missing error handling
- Security anti-patterns (hardcoded secrets, SQL string concatenation, missing input validation)
- Naming inconsistencies with the rest of the codebase
- Missing or incorrect test coverage for obvious branches
- Drift from documented API contracts
What it consistently misses:
- Whether the abstraction chosen is appropriate for future requirements
- Whether the PR is solving the right problem
- Subtle race conditions in distributed systems
- Business logic correctness that requires domain knowledge
- Whether a change creates problematic coupling between services
We treat AI pre-review as a first-pass filter, not a gatekeeper. The output gets posted as a bot comment before any human sees the PR. If the AI pre-review is clean, we reduce required human reviewers from 2 to 1 for low-risk changes. If it flags something, that flag is either addressed or explicitly overridden with a comment before merge.
The measurable outcome: average time-to-first-human-review dropped from 18 hours to 11 hours, because the AI comment gives reviewers a starting point instead of a blank page.
PR Size as a First-Class Metric
Large PRs are the root cause of most review quality problems. A 1,200-line PR gets rubber-stamped. A 120-line PR gets a real review.
We track median PR size per team as an engineering metric, reviewed quarterly. Not as a punitive measure — the goal is to surface habits and patterns, not shame individuals. Teams with consistently large PRs typically have one of three problems: they lack feature-flag infrastructure so they can't ship incrementally, they have a test setup that makes small changes expensive, or they have implicit norms that treat "big refactor in one PR" as a sign of thoroughness.
Our target: 80% of PRs under 400 lines changed, 95% under 800. This is looser than the "200-line PR" guidance you'll read in most engineering blogs, because we found that very strict limits created gaming behavior (splitting a logical change across five PRs to hit the number).
# Example: script to compute PR size distribution from GitHub API
import requests
def get_pr_size_distribution(repo: str, token: str, days: int = 30) -> dict:
headers = {"Authorization": f"token {token}"}
url = f"https://api.github.com/repos/{repo}/pulls"
params = {"state": "closed", "per_page": 100}
sizes = []
response = requests.get(url, headers=headers, params=params)
for pr in response.json():
additions = pr.get("additions", 0)
deletions = pr.get("deletions", 0)
sizes.append(additions + deletions)
sizes.sort()
n = len(sizes)
return {
"p50": sizes[n // 2],
"p80": sizes[int(n * 0.8)],
"p95": sizes[int(n * 0.95)],
"mean": sum(sizes) / n,
"count": n,
}Review Checklists That Teams Own
The worst checklists are corporate mandates nobody reads. The best checklists are written by the team that uses them and reviewed every quarter.
We moved to team-owned review checklists stored in the same repository as the code. Each team's CONTRIBUTING.md includes a short PR checklist — typically 6–10 items — that reflects the failure modes their codebase has actually experienced.
The payments team checklist includes: "Is idempotency key handling covered by a test?" The data pipeline team includes: "Does this change affect backfill behavior, and is that documented?" These are not generic best practices — they're hard-won lessons encoded as reminders.
The discipline of writing the checklist is as valuable as the checklist itself. When a team writes "does this change require a migration?" they're acknowledging that they've been burned by undocumented migrations. The artifact forces the conversation.
Reviewer Fatigue and the Rotation Model
We tried a lot of things to address reviewer fatigue before landing on what actually worked: a formal reviewer rotation, tracked and enforced by automation.
Every squad has a "review rotation" — a queue of engineers who take the first review assignment for incoming PRs. The rotation advances daily. Engineers in the rotation are expected to do their assigned reviews within 4 business hours. Engineers not in the rotation are not expected to review anything unless specifically requested.
This sounds rigid. In practice, it's liberating. Engineers know exactly when they're "on" and when they're not. When you're not in the rotation, you can get into deep work without Slack notifications pulling you out every 45 minutes.
We track rotation health with two metrics: average time-to-first-review from rotation members (target: under 4 hours) and review completion rate (target: 95% of assigned reviews completed without reassignment). Both are reviewed monthly by engineering managers.
Key Takeaways
- Synchronous review is a trap at scale — async with written, closed threads is faster and produces searchable institutional memory.
- CODEOWNERS should be team-scoped, not individual-scoped, and paired with explicit ownership rotation to prevent knowledge silos.
- AI pre-review is a useful first-pass filter for mechanical issues and security patterns, but cannot replace domain-aware human judgment for architecture and business logic.
- PR size is a lagging indicator of engineering health — teams with consistently large PRs usually have a tooling or norm problem, not a discipline problem.
- Team-owned review checklists encoding actual past failures outperform generic best-practice lists from corporate policy.
- Formal reviewer rotations with 4-hour SLAs give engineers predictability and protect deep work time without sacrificing review quality.