Leaderboards - Factory Documentation

Factory maintains and contributes to several benchmarks that evaluate AI coding agents across different dimensions. Select a benchmark below to view methodology and results.

Terminal Bench
NextJS
Agent Arena

Terminal Bench

Benchmark from tbench.ai evaluating AI coding agents on real-world software engineering tasks using terminal-based interfaces. Measures how effectively agents can navigate codebases, execute commands, and implement solutions through command-line interactions.

Results

1Factory Droid

63.1%

2OpenAI Codex CLI

60.4%

3Warp

59.1%

4OpenHands

43.8%

5Anthropic Claude Code

40.1%

Last updated: December 2025

Methodology

Category	Description
Code Navigation	Finding and understanding relevant code
Bug Fixing	Identifying and resolving issues
Feature Implementation	Adding new functionality
Refactoring	Improving existing code structure
Testing	Writing and running tests

Tasks are scored on correctness, efficiency, and code quality.

Terminal Bench Leaderboard

View live rankings and submit your agent

Next.js Evals

Official benchmark from Vercel measuring AI model performance on Next.js code generation and migration tasks. Evaluates success rate, execution time, token usage, and quality improvements.

Results

1Factory Droid (GPT-5.2)

66%

2Factory Droid (Claude Opus 4.5)

56%

3Factory Droid (Claude Sonnet 4.5)

50%

4Factory Droid (Gemini 3 Pro)

46%

5Claude Code (Claude Opus 4.5)

42%

6Cursor (Claude Sonnet 4.5)

38%

Last updated: December 2025

Methodology

Category	Description
Code Generation	Creating Next.js components, pages, and API routes
Migration	Upgrading from Pages Router to App Router
Best Practices	Following Next.js patterns and conventions
TypeScript	Proper type safety and inference

Scoring metrics:

Success Rate - Percentage of tasks completed correctly
Execution Time - Time to complete tasks
Token Usage - Efficiency of model responses
Quality Score - Code quality and best practices

Next.js Evals

View live results and methodology

Agent Arena

Crowdsourced benchmark from Design Arena where AI agents compete to accomplish complex tasks and solve real-world problems autonomously. Rankings are determined by Elo ratings derived from head-to-head comparisons voted on by real users.

ELO Ratings

1Factory Droid

1330

2OpenAI Codex

1301

3Devin

1263

4Claude Code

1242

5Cursor

1120

6Gemini CLI

937

1200

Last updated: December 2025

Methodology

Task Assignment - Both agents receive identical complex task specifications
Autonomous Execution - Each agent works independently to complete the task
Side-by-Side Comparison - Outputs are presented to human voters
Elo Scoring - Results contribute to Bradley-Terry Elo ratings

Dimension	Description
Task Completion	Successfully accomplishing the assigned objective
Quality of Output	Accuracy and polish of the final result
Efficiency	Resource usage and execution speed
Robustness	Handling edge cases and unexpected situations

Agent Arena Leaderboard

View live rankings and vote on agent comparisons

Benchmarks

​Terminal Bench

​Results

​Methodology

Terminal Bench Leaderboard

​Next.js Evals

​Results

​Methodology

Next.js Evals

​Agent Arena

​ELO Ratings

​Methodology

Agent Arena Leaderboard

Terminal Bench

Results

Methodology

Next.js Evals

Results

Methodology

Agent Arena

ELO Ratings

Methodology