Skip to main content
Open-source benchmark from droid-code-review-evals measuring how well AI models catch real bugs in code review. Evaluates 13 frontier and open-source models across 50 pull requests from 5 large open-source codebases (Sentry, Grafana, Keycloak, Discourse, Cal.com), scored against a manually curated golden set of 167 validated bugs.

Cost vs. Quality

ModelMean F1Cost / PR
GPT-5.260.5%$1.25
Claude Opus 4.659.8%$3.11
Claude Sonnet 4.657.9%$1.15
Claude Opus 4.756.4%$4.18
GLM-5.156.3%$1.06
GPT-5.3 Codex56.2%$1.69
Gemini 3.1 Pro52.6%$2.04
GPT-5.4 Mini52.0%$0.68
Kimi K2.551.9%$0.41
Gemini 3 Flash50.0%$0.34
GPT-5.547.9%$5.63
GPT-5.447.5%$2.01
MiniMax M2.745.6%$0.15
Last updated: April 2026 GPT-5.2 leads on quality at about 40% of the cost of Claude Opus 4.6. Open-source models like Kimi K2.5 and MiniMax M2.7 deliver ~75–86% of GPT-5.2 quality at ~3–8× lower cost per PR, opening the door to multi-pass and ensemble review strategies.

Methodology

StageDescription
Test set50 PRs across 5 repos (Sentry, Grafana, Keycloak, Discourse, Cal.com) covering Python, Go, Java, Ruby, and TypeScript
Golden set167 manually validated bugs (v3) with exact file/line locations and bug-type classifications
Model evaluationEach model reviews every PR via Droid Action using a standardized prompt
LLM judgeAn independent LLM matches model comments to golden comments by semantic equivalence
Cross-judge validationMatches are spot-checked with a second judge to control for grading bias
F1 calculationF1 combines precision (fraction of comments that are real bugs) and recall (fraction of golden bugs caught)
Multiple runsEach model is evaluated over multiple runs to measure consistency
Outlier exclusionRuns that error out or hit token limits are excluded

Review Droid Benchmark

View the full methodology, raw results, and scoring scripts on GitHub

Read the writeup

Which Model Reviews Code Best?