Scale AI Introduces New Benchmark for Assessing AI Coding Agents

Scale AI has introduced SWE-Atlas, a new benchmark designed to evaluate how well AI coding agents perform real-world software engineering tasks inside complex codebases rather than simply generating code snippets.

“SWE-Atlas is a benchmark for evaluating AI coding agents across a spectrum of professional software engineering tasks,” the company said in its announcement.

The benchmark includes three complementary leaderboards: Codebase QnA, Test Writing, and Refactoring.

Of these, Codebase QnA is the first component released publicly, while the other two evaluations are expected to be introduced later.

Codebase QnA focuses on testing how well AI agents understand large software systems before attempting modifications.

The dataset contains 124 tasks drawn from 11 production repositories written in Go, Python, C, and TypeScript. Agents are placed inside sandboxed Docker environments containing the repositories and must answer technical questions by exploring the codebase, executing commands, and analysing runtime behaviour.

Scale AI said that these tasks require running the software, tracing execution across multiple files, and synthesising findings.

The benchmark covers a variety of real engineering investigations. Tasks include questions about system architecture, debugging root causes of unusual behaviour, onboarding explanations for engineers new to a codebase, security reasoning, and API integration.

Prompts are intentionally written in natural language and are underspecified, requiring agents to autonomously explore the system and gather evidence.

Each task includes structured evaluation rubrics written by professional software engineers. During scoring, an automated judge evaluates whether the agent’s answer satisfies each rubric criterion. The primary metric is Task Resolve Rate, defined as the percentage of tasks where the agent’s response satisfies all rubric requirements.

“We observe that even the most frontier models (that report >80% on SWE-Bench) score <30% on the benchmark, highlighting the challenging nature of these tasks and the gap in capability in deeply understanding the codebase,” the researchers wrote.

Results on the benchmark show that even the most advanced coding models struggle to understand deep codebase.

The top score came from Claude Opus 4.6 running on the Claude Code harness, which achieved a 31.5% task-resolve rate, meaning it fully satisfied all rubric requirements on roughly one-third of the tasks.

Close behind were GPT-5.2 (high reasoning setting) and Opus 4.6 on the SWE-Agent Scaffold, both scoring 29.03%. GPT-5.3 Codex followed with 27.4% when run through the Codex CLI environment. Mid-tier results included Claude Sonnet 4.5 at 23.39%, while open model GLM-5 scored 21.77%.

Performance dropped sharply for several other models: Gemini 3.1 Pro reached 12.1%, while Gemini 3 Flash scored 8.06% and Qwen3-Coder-480B-A35B recorded the lowest score at 4.84%.

The study also found that coding agents rely heavily on tool usage—searching files, executing commands, and running programmes—to investigate systems. When these capabilities were removed and agents were limited to only viewing and searching code, performance dropped by roughly 40–45%.

“As LLMs and coding agents take on that work, evaluation must evolve to view them more like junior engineers: by how they investigate a system, gather evidence, and explain what they’re observing,” the company wrote.

ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit

Join Our Core Community

From Generic Models to Living Twins: A Practitioner’s Guide to ML in Design Workflows

Designing AI‑Ready Public Infrastructure: Global Lessons from India’s Aadhaar Builder

What “High-Risk AI” Actually Means for the Teams Running HR, Finance and Customer Ops

DXC’s LabX is Beating AI Theatre

Scaling Telehealth Without Scaling Fraud: The Case for an AI Trust Layer

Banks Are Drowning in Data and Starving for Insight

Unstructured Data, Deterministic Answers

Data Layer Precedes Compute, GPU Capacity in Sovereign AI

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

IBM Unveils World’s First Sub-1nm Chip Technology

OpenAI, Broadcom Unveil Custom Inference Chip Jalapeño for LLM Workloads

Figma Unveils AI Agents and Code-Native Design Tools at Config 2026

Los Angeles Opens DATALAND, World’s First Museum Featuring Solely AI Art

AI Infra Startup Baseten Announces GLM-5.2’s Fastest API Yet

Scale AI Introduces New Benchmark for Assessing AI Coding Agents

IBM Unveils World’s First Sub-1nm Chip Technology

OpenAI, Broadcom Unveil Custom Inference Chip Jalapeño for LLM Workloads

Unpack More

OpenAI to Acquire Ona for Persistent Enterprise AI Agents

OpenAI to Expand Codex Beyond Coding to Role-Specific Applications and More

Microsoft Launches 7 In-House AI Models at Microsoft Build 2026

Devin Maker Cognition Sees High Adoption, Raises $1Bn

Why Data Reliability Now Governs Scaling GenAI

Middle East: The Sovereign AI Testbed US, EU and Asia Can Learn From

NVIDIA’s VP of Solutions Architecture on What It Actually Takes to Build a Sovereign AI Factory