GPT-5.5 Beats Claude and Gemini in New Coding Benchmark DeepSWE

Claude Opus 4.7 and GPT-5.4 wrote new repository-native tests in over 80% of DeepSWE runs.

Share

OpenAI’s GPT-5.5 has emerged as the top-performing AI coding model on DeepSWE, a new long-horizon software engineering benchmark designed to test whether frontier AI agents can independently complete realistic software development tasks from start to finish.

The benchmark places GPT-5.5 significantly ahead of competing models from Anthropic and Google, with Claude Opus 4.7 scoring 54% and Gemini 3.1 Pro managing just 10%. GPT-5.5 achieved the highest score overall at 70%, followed by GPT-5.4 at 56%.

DeepSWE is developed by researchers from Datacurve Co-Founder Charley Lee, CEO Serena Ge, and Founding Engineers Wenqi Huang and Leonard Tng. It evaluates AI models on original software engineering tasks written entirely from scratch. 

Unlike traditional coding evaluations that focus on isolated bug fixes or pre-existing GitHub pull requests, DeepSWE tests whether agents can perform full-stack engineering work with minimal guidance.

According to the researchers, DeepSWE was created to better reflect how developers actually use AI coding models today. 

Existing public benchmarks such as SWE-Bench Pro often contain detailed implementation clues, code snippets, and references to existing fixes that models may already have seen during training.

“This makes DeepSWE a cleaner test of whether an agent can solve a novel software engineering problem, rather than recall, retrieve, or rediscover a public fix,” the report stated.

The benchmark consists of 113 tasks spanning 91 open-source repositories across TypeScript, Go, Python, JavaScript, and Rust. AI agents are given short, behaviour-focused prompts and must independently explore repositories, determine architecture, implement functionality, and verify correctness.

The evaluation framework uses a standardised harness called mini-swe-agent, which provides all models with identical tooling and prompts.

The tasks themselves are substantially more complex than earlier benchmarks. On average, DeepSWE solutions require 668 lines of code across seven files, compared to roughly 120 lines across five files in SWE-Bench Pro.

Researchers also focused heavily on evaluation reliability. Instead of inheriting tests from historical pull requests, DeepSWE uses custom behavioural verifiers designed specifically for each task. These tests assess observable software behaviour rather than checking for specific implementation patterns.

“The analyser disagreed with the SWE-Bench Pro verifier on 32% of trials and with the DeepSWE verifier on 1.4%,” the report noted, adding that “nearly a third of SWE-Bench Pro’s pass/fail decisions appear incorrect to a careful reader of the same trajectory.”

The benchmark also revealed distinct behavioural patterns among model families. Claude models frequently failed to implement multi-part requirements, whereas GPT based models showed stronger instruction fidelity and consistency across repeated runs.

“GPT reads the prompt and the visible repository contract literally, and produces a patch that honours both,” the researchers wrote. “The behaviour is consistent across runs.”

The report also found that stronger models increasingly verified their own outputs without explicit prompting. Claude Opus 4.7 and GPT-5.4 wrote new repository-native tests in over 80% of DeepSWE runs.

“We notice that models are a lot less likely to write their own tests on SWE-Bench Pro than on DeepSWE,” the authors stated, attributing the difference to SWE-Bench Pro’s prompt design, which discourages modifying test logic.

Beyond accuracy, the benchmark measured efficiency across runtime, token usage, and cost. GPT-5.5 achieved its 70% score with a median runtime of 20 minutes and approximately 47,000 output tokens per trial, making it one of the most efficient high-performing models tested.

The DeepSWE team acknowledged that the benchmark still has limitations, including restricted language coverage and a focus on open-source repositories with at least 500 GitHub stars. Future versions are expected to include additional languages such as C++ and Java, as well as more bug localisation and refactoring tasks.

ALSO READ: Alteryx Inspire 2026: Three Questions Every Data Leader Should Take to Orlando

Staff Writer
Staff Writer
The AI & Data Insider team works with a staff of in-house writers and industry experts.

Related

spot_img

Unpack More