Scale AI Introduces New Benchmark for Assessing AI Coding Agents

Share

Scale AI has introduced SWE-Atlas, a new benchmark designed to evaluate how well AI coding agents perform real-world software engineering tasks inside complex codebases rather than simply generating code snippets.

“SWE-Atlas is a benchmark for evaluating AI coding agents across a spectrum of professional software engineering tasks,” the company said in its announcement.

The benchmark includes three complementary leaderboards: Codebase QnA, Test Writing, and Refactoring. 

Of these, Codebase QnA is the first component released publicly, while the other two evaluations are expected to be introduced later.

Codebase QnA focuses on testing how well AI agents understand large software systems before attempting modifications. 

The dataset contains 124 tasks drawn from 11 production repositories written in Go, Python, C, and TypeScript. Agents are placed inside sandboxed Docker environments containing the repositories and must answer technical questions by exploring the codebase, executing commands, and analysing runtime behaviour.

Scale AI said that these tasks require running the software, tracing execution across multiple files, and synthesising findings.

The benchmark covers a variety of real engineering investigations. Tasks include questions about system architecture, debugging root causes of unusual behaviour, onboarding explanations for engineers new to a codebase, security reasoning, and API integration. 

Prompts are intentionally written in natural language and are underspecified, requiring agents to autonomously explore the system and gather evidence.

Each task includes structured evaluation rubrics written by professional software engineers. During scoring, an automated judge evaluates whether the agent’s answer satisfies each rubric criterion. The primary metric is Task Resolve Rate, defined as the percentage of tasks where the agent’s response satisfies all rubric requirements.

“We observe that even the most frontier models (that report >80% on SWE-Bench) score <30% on the benchmark, highlighting the challenging nature of these tasks and the gap in capability in deeply understanding the codebase,” the researchers wrote.

Results on the benchmark show that even the most advanced coding models struggle to understand deep codebase. 

The top score came from Claude Opus 4.6 running on the Claude Code harness, which achieved a 31.5% task-resolve rate, meaning it fully satisfied all rubric requirements on roughly one-third of the tasks. 

Close behind were GPT-5.2 (high reasoning setting) and Opus 4.6 on the SWE-Agent Scaffold, both scoring 29.03%. GPT-5.3 Codex followed with 27.4% when run through the Codex CLI environment. Mid-tier results included Claude Sonnet 4.5 at 23.39%, while open model GLM-5 scored 21.77%. 

Performance dropped sharply for several other models: Gemini 3.1 Pro reached 12.1%, while Gemini 3 Flash scored 8.06% and Qwen3-Coder-480B-A35B recorded the lowest score at 4.84%. 

The study also found that coding agents rely heavily on tool usage—searching files, executing commands, and running programmes—to investigate systems. When these capabilities were removed and agents were limited to only viewing and searching code, performance dropped by roughly 40–45%.

“As LLMs and coding agents take on that work, evaluation must evolve to view them more like junior engineers: by how they investigate a system, gather evidence, and explain what they’re observing,” the company wrote.

ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit

Staff Writer
Staff Writer
The AI & Data Insider team works with a staff of in-house writers and industry experts.

Related

Unpack More