Skip to main content
Benchmark from Factory measuring AI agent performance on legacy engineering tasks across COBOL, Java 7, BASIC, C89, Fortran, and Assembly.

Results — Overall Pass Rate

Last updated: April 2026

Methodology

StageDescription
Task setHundreds of tasks across six legacy language families, with ten representative open samples
Task formatNatural language instruction, containerized source environment, reference solution, and hidden verification tests
Task typesBug fixing, implementation, migration, and other legacy engineering work
EvaluationHarbor-compatible tasks requiring agents to understand the specification, produce working code, and pass verification
ScoringPass rate across hidden tests for 12 model-agent combinations

Benchmark Mix

LanguageShareExample domains
COBOL46%Financial settlement, payroll processing, insurance claims, telecom billing, VSAM file handling
Java 732%Enterprise middleware, CDR processing, warehouse logistics, binary parsing, EJB patterns
BASIC6%Business applications, accounting, data processing
C895%Systems programming, low-level debugging, protocol implementation
Fortran5%Scientific computing, numerical methods, physics simulation
Assembly5%x86 firmware parsing, protocol decoding, hardware simulation
Agents score highest on Java 7 bug fixing, where compiler and runtime feedback expose errors. COBOL remains hardest: 31 of the 44 tasks no model solved are COBOL.

Legacy-Bench

View sample tasks and the evaluation harness on GitHub

Read the writeup

Legacy-Bench: Can AI Agents Maintain the World’s Most Critical Software?