Claude Opus 4.7, Gemini 3.1 Pro, and Other Models Score 0% on New SWE Benchmark

Models frequently chose to write their solutions in Python, even when the original software was written in C++, Rust, or Go.

Share

Claude Opus 4.7, Gemini 3.1 Pro, GPT 5.4 and others score 0% on the new benchmark developed by Meta, Harvard and Stanford. 

The evaluation framework, called ProgramBench, tests the ability of software engineering agents to develop complete software projects holistically from scratch. 

In this evaluation, agents receive a compiled software executable and its user documentation without any access to the original source code or the internet. 

They must then write source code and a build script to produce a program that matches the original software’s input-output behaviour. 

The evaluation deployed these language models within the mini-SWE-agent framework. This scaffold provides the models with terminal access to execute commands and write files. 

According to the researchers, technology companies increasingly use these models to “turn ideas expressed in natural language into full-fledged code repositories”, requiring the models to make independent, high-level architectural decisions. To achieve this, a software engineering agent operates as a language model “equipped with an agent scaffold to interact with a terminal environment”.

Researchers constructed 200 task instances from open-source GitHub repositories. These range from compact command-line utilities to widely used applications such as the PHP interpreter, the SQLite database, and the multimedia framework FFmpeg. 

To verify the generated code, the team uses automated behavioural tests generated via agent-driven fuzzing. The creators noted that “existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature.” 

In contrast, ProgramBench forces the model to choose the programming language, formulate algorithms, and define data structures without a predetermined skeleton.

Out of nine tested models, none successfully passed all tests for any single task. Anthropic’s Claude Opus 4.7 achieved the highest partial success, passing 95% of tests on just 3% of the instances. 

The models demonstrated distinct patterns during the development cycle. Despite the complexity of the tasks, agents voluntarily submitted their solutions before reaching the maximum turn limit in 98.1% of the 1,800 runs, with only 1.9% exhausting the six-hour time allowance. 

Furthermore, models frequently chose to write their solutions in Python, which accounted for 36% of all runs, even when the original software was written in C++, Rust, or Go. 

When researchers conducted a separate test allowing unrestricted web access, the AI models frequently bypassed the reverse-engineering process. Despite explicit system instructions forbidding the behaviour, models looked up existing source code up to 36% of the time.

Analysis also revealed that these models construct software differently from human engineers. The study found that “models favour monolithic, single-file implementations that diverge sharply from human-written code.” 

While human developers separate functions into modular directory structures, the models typically place all logic into a few root-level files. 

Furthermore, the agents wrote substantially less code overall. They generated between 10 and 29% as many functions as the reference implementations, compensating by making individual functions up to 1.62 times longer.

ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit

Staff Writer
Staff Writer
The AI & Data Insider team works with a staff of in-house writers and industry experts.

Related

spot_img

Unpack More