Legacy Bench - Factory Documentation

Results — Overall Pass Rate
Methodology
Benchmark Mix

Benchmark from Factory measuring AI agent performance on legacy engineering tasks across COBOL, Java 7, BASIC, C89, Fortran, and Assembly.

Results — Overall Pass Rate

Last updated: April 2026

Methodology

Stage	Description
Task set	Hundreds of tasks across six legacy language families, with ten representative open samples
Task format	Natural language instruction, containerized source environment, reference solution, and hidden verification tests
Task types	Bug fixing, implementation, migration, and other legacy engineering work
Evaluation	Harbor-compatible tasks requiring agents to understand the specification, produce working code, and pass verification
Scoring	Pass rate across hidden tests for 12 model-agent combinations

Benchmark Mix

Language	Share	Example domains
COBOL	46%	Financial settlement, payroll processing, insurance claims, telecom billing, VSAM file handling
Java 7	32%	Enterprise middleware, CDR processing, warehouse logistics, binary parsing, EJB patterns
BASIC	6%	Business applications, accounting, data processing
C89	5%	Systems programming, low-level debugging, protocol implementation
Fortran	5%	Scientific computing, numerical methods, physics simulation
Assembly	5%	x86 firmware parsing, protocol decoding, hardware simulation

Agents score highest on Java 7 bug fixing, where compiler and runtime feedback expose errors. COBOL remains hardest: 31 of the 44 tasks no model solved are COBOL.

Legacy-Bench

View sample tasks and the evaluation harness on GitHub

Read the writeup

Legacy-Bench: Can AI Agents Maintain the World’s Most Critical Software?

Review Benchmark

​Results — Overall Pass Rate

​Methodology

​Benchmark Mix

Legacy-Bench

Read the writeup

Results — Overall Pass Rate

Methodology

Benchmark Mix