Cost vs. Quality
| Model | Mean F1 | Cost / PR |
|---|---|---|
| GPT-5.2 | 60.5% | $1.25 |
| Claude Opus 4.6 | 59.8% | $3.11 |
| Claude Sonnet 4.6 | 57.9% | $1.15 |
| Claude Opus 4.7 | 56.4% | $4.18 |
| GLM-5.1 | 56.3% | $1.06 |
| GPT-5.3 Codex | 56.2% | $1.69 |
| Gemini 3.1 Pro | 52.6% | $2.04 |
| GPT-5.4 Mini | 52.0% | $0.68 |
| Kimi K2.5 | 51.9% | $0.41 |
| Gemini 3 Flash | 50.0% | $0.34 |
| GPT-5.5 | 47.9% | $5.63 |
| GPT-5.4 | 47.5% | $2.01 |
| MiniMax M2.7 | 45.6% | $0.15 |
Methodology
| Stage | Description |
|---|---|
| Test set | 50 PRs across 5 repos (Sentry, Grafana, Keycloak, Discourse, Cal.com) covering Python, Go, Java, Ruby, and TypeScript |
| Golden set | 167 manually validated bugs (v3) with exact file/line locations and bug-type classifications |
| Model evaluation | Each model reviews every PR via Droid Action using a standardized prompt |
| LLM judge | An independent LLM matches model comments to golden comments by semantic equivalence |
| Cross-judge validation | Matches are spot-checked with a second judge to control for grading bias |
| F1 calculation | F1 combines precision (fraction of comments that are real bugs) and recall (fraction of golden bugs caught) |
| Multiple runs | Each model is evaluated over multiple runs to measure consistency |
| Outlier exclusion | Runs that error out or hit token limits are excluded |
Review Droid Benchmark
View the full methodology, raw results, and scoring scripts on GitHub
Read the writeup
Which Model Reviews Code Best?
