Skip to main content
Factory maintains and contributes to several benchmarks that evaluate AI coding agents across different dimensions. Select a benchmark below to view methodology and results.

Terminal Bench

Benchmark from tbench.ai evaluating AI coding agents on real-world software engineering tasks using terminal-based interfaces. Measures how effectively agents can navigate codebases, execute commands, and implement solutions through command-line interactions.

Results

1Factory Droid
63.1%
2OpenAI Codex CLI
60.4%
3Warp
59.1%
4OpenHands
43.8%
5Anthropic Claude Code
40.1%
Last updated: December 2025

Methodology

CategoryDescription
Code NavigationFinding and understanding relevant code
Bug FixingIdentifying and resolving issues
Feature ImplementationAdding new functionality
RefactoringImproving existing code structure
TestingWriting and running tests
Tasks are scored on correctness, efficiency, and code quality.

Terminal Bench Leaderboard

View live rankings and submit your agent