Claude Opus 4.7, Gemini 3.1 Pro, and Other Models Score 0% on New SWE Benchmark

Claude Opus 4.7, Gemini 3.1 Pro, GPT 5.4 and others score 0% on the new benchmark developed by Meta, Harvard and Stanford.

The evaluation framework, called ProgramBench, tests the ability of software engineering agents to develop complete software projects holistically from scratch.

In this evaluation, agents receive a compiled software executable and its user documentation without any access to the original source code or the internet.

They must then write source code and a build script to produce a program that matches the original software’s input-output behaviour.

The evaluation deployed these language models within the mini-SWE-agent framework. This scaffold provides the models with terminal access to execute commands and write files.

According to the researchers, technology companies increasingly use these models to “turn ideas expressed in natural language into full-fledged code repositories”, requiring the models to make independent, high-level architectural decisions. To achieve this, a software engineering agent operates as a language model “equipped with an agent scaffold to interact with a terminal environment”.

Researchers constructed 200 task instances from open-source GitHub repositories. These range from compact command-line utilities to widely used applications such as the PHP interpreter, the SQLite database, and the multimedia framework FFmpeg.

To verify the generated code, the team uses automated behavioural tests generated via agent-driven fuzzing. The creators noted that “existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature.”

In contrast, ProgramBench forces the model to choose the programming language, formulate algorithms, and define data structures without a predetermined skeleton.

Out of nine tested models, none successfully passed all tests for any single task. Anthropic’s Claude Opus 4.7 achieved the highest partial success, passing 95% of tests on just 3% of the instances.

The models demonstrated distinct patterns during the development cycle. Despite the complexity of the tasks, agents voluntarily submitted their solutions before reaching the maximum turn limit in 98.1% of the 1,800 runs, with only 1.9% exhausting the six-hour time allowance.

Furthermore, models frequently chose to write their solutions in Python, which accounted for 36% of all runs, even when the original software was written in C++, Rust, or Go.

When researchers conducted a separate test allowing unrestricted web access, the AI models frequently bypassed the reverse-engineering process. Despite explicit system instructions forbidding the behaviour, models looked up existing source code up to 36% of the time.

Analysis also revealed that these models construct software differently from human engineers. The study found that “models favour monolithic, single-file implementations that diverge sharply from human-written code.”

While human developers separate functions into modular directory structures, the models typically place all logic into a few root-level files.

Furthermore, the agents wrote substantially less code overall. They generated between 10 and 29% as many functions as the reference implementations, compensating by making individual functions up to 1.62 times longer.

ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit

Join Our Core Community

Senior AI Talent is Choosing Stability—and Often that Means Europe

Traceability Shifts Trust from Theoretical to Demonstrated

The AI Infra Deals That Defined Q1 2026

Inside the Orchestration Crisis: Why AI‑Driven Enterprises Need a Control Plane, Not More Tools

AI Sovereignty is Really About Managing Dependencies, Not Going it Alone

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

Inside IBM’s 11 Billion Dollar Bet: What the Confluent Deal Reveals About AI’s Investment Paradox

“Synthetic Data Is Not the Ground Truth” — SandboxAQ’s VP of Engineering on Simulation’s Power and Limits

Data as the New Diagnostic: How Ahead Health is Turning Algorithms Into Preventive Care

OpenAI Makes GPT-5.5 Instant as the New Default Model in ChatGPT

Google Speeds Up Text Generation for Gemma 4

ElevenLabs Crosses $500 Mn ARR, Adds BlackRock, NVIDIA Among New Investors