Results — Overall Pass Rate
Last updated: April 2026Methodology
| Stage | Description |
|---|---|
| Task set | Hundreds of tasks across six legacy language families, with ten representative open samples |
| Task format | Natural language instruction, containerized source environment, reference solution, and hidden verification tests |
| Task types | Bug fixing, implementation, migration, and other legacy engineering work |
| Evaluation | Harbor-compatible tasks requiring agents to understand the specification, produce working code, and pass verification |
| Scoring | Pass rate across hidden tests for 12 model-agent combinations |
Benchmark Mix
| Language | Share | Example domains |
|---|---|---|
| COBOL | 46% | Financial settlement, payroll processing, insurance claims, telecom billing, VSAM file handling |
| Java 7 | 32% | Enterprise middleware, CDR processing, warehouse logistics, binary parsing, EJB patterns |
| BASIC | 6% | Business applications, accounting, data processing |
| C89 | 5% | Systems programming, low-level debugging, protocol implementation |
| Fortran | 5% | Scientific computing, numerical methods, physics simulation |
| Assembly | 5% | x86 firmware parsing, protocol decoding, hardware simulation |
Legacy-Bench
View sample tasks and the evaluation harness on GitHub
Read the writeup
Legacy-Bench: Can AI Agents Maintain the World’s Most Critical Software?
