OpenAI’s New Benchmark Shows Anthropic’s Claude Outperforms All Models

ChatGPT maker OpenAI released a new benchmark called GDPval, which evaluates AI models on ‘real-world’ economically valuable tasks.

Unlike most other benchmarks that resemble academic tests, GDPval focuses on practical work covering 44 occupations across the top nine sectors contributing to the GDP of the United States. The GDPval complete set includes 1,320 specialised tasks (220 in the gold open-sourced set),

These tasks were created by industry experts with an average of 14 years of experience, drawn from companies including Google, Goldman Sachs, Microsoft, and others.

According to the benchmark results, OpenAI revealed that Anthropic’s Claude Opus 4.1 outperformed all other tested models, including GPT-5.

“Claude Opus 4.1 was the best performing model on the GDPval gold subset, excelling in particular on aesthetics (e.g., document formatting, slide layout), while GPT-5 excelled in particular on accuracy (e.g., carefully following instructions, performing correct calculations),” read the study.

Other models tested in the study included OpenAI’s GPT-4o, o4-mini, o3, Google’s Gemini 2.5 Pro and xAI’s Grok 4. The results of these models were evaluated through blind pairwise comparisons conducted by industry experts.

“We analyse the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts,” said OpenAI.

47.6% of deliverables by Claude Opus 4.1 were graded as better than (wins) or as good as (ties) the human deliverables, whereas the GPT-5 (high) scored 38.8%, and the o3-high scored 34.1%.

Each task includes a request with reference files and requires producing a deliverable that mirrors real work products.

The benchmark encompasses diverse file formats, including CAD design files, spreadsheets, slide decks, videos, and customer support conversations, with tasks requiring parsing through up to 38 reference files.

The test also evaluated various strengths and weaknesses of the models under consideration.

“Claude, Grok, and Gemini most often lost due to instruction-following failures, while GPT-5 (high) lost mainly from formatting errors and had the fewest instruction-following issues,” read the study.

“Gemini and Grok frequently promised but failed to provide deliverables, ignored reference data, or used the wrong format. GPT-5 and Grok showed the fewest accuracy errors, though all models sometimes hallucinated data or miscalculated.”

For more evaluations, and details regarding the study, the technical report can be found here.

Earlier studies from OpenAI this year, namely the PaperBench, and the SWE-Lancer also acknowledged that the model tested from Anthropic yielded the best results. “Kind of wild that OpenAI released a benchmark that Claude beats them in. Respect,” said Theo Browne, the founder and CEO of T3 Chat, in a post on X.

ALSO READ: Microsoft Develops New AI Chip Cooling System

Join Our Core Community

Intent: The Missing Data Layer in Generative AI

The Death of the Generalist: 5 Specialised Copilots Rewriting the Enterprise Stack

Building an AI-Ready Leadership Culture: Inside Raja Sampathi’s Transformation Framework

Why Frontline Knowledge Beats AI Expertise: Rethinking Enterprise AI Talent

Overcommitment to the Cloud Means Enterprises Could Miss an AI Edge

Data as the New Diagnostic: How Ahead Health is Turning Algorithms Into Preventive Care

Why Data Leaders Are Wary of a Synthetic Future

Is Your Enterprise Data Stack Ready for Agentic AI? 10 Signs to Check

2025’s Top 16 Acquisitions in AI & Data

Geopatriation for Cloud Sovereignty: Why 75% Are Moving Home by 2030

NTT DATA, AWS Partner to Accelerate Enterprise Cloud, Agentic AI Adoption

ServiceNow to Deploy Anthropic’s Claude For 29,000 Employees

Google Brings Gemini-Powered AI to Chrome

Anthropic Adds Interactive Work Tools Inside Claude

SoftBank To Invest an Additional $30 Bn in OpenAI: Reports

OpenAI’s New Benchmark Shows Anthropic’s Claude Outperforms All Models

Earlier studies from OpenAI this year also acknowledged that the model tested from Anthropic yielded the best results.

NTT DATA, AWS Partner to Accelerate Enterprise Cloud, Agentic AI Adoption

ServiceNow to Deploy Anthropic’s Claude For 29,000 Employees

Unpack More

ServiceNow to Deploy Anthropic’s Claude For 29,000 Employees

Anthropic Adds Interactive Work Tools Inside Claude

SoftBank To Invest an Additional $30 Bn in OpenAI: Reports

OpenAI Launches Prism, an AI Workspace for Scientific Writing

AI & Data Insider’s Contributors’ Circle: Meet 2025’s Leading Voices

How AI is Finally Repealing Biology’s Most Expensive Law

Are Static Benchmarks for LLMs Giving a False Sense of Security?