OpenAI’s New Benchmark Shows Anthropic’s Claude Outperforms All Models

Earlier studies from OpenAI this year also acknowledged that the model tested from Anthropic yielded the best results.

Share

ChatGPT maker OpenAI released a new benchmark called GDPval, which evaluates AI models on ‘real-world’ economically valuable tasks.

Unlike most other benchmarks that resemble academic tests, GDPval focuses on practical work covering 44 occupations across the top nine sectors contributing to the GDP of the United States. The GDPval complete set includes 1,320 specialised tasks (220 in the gold open-sourced set),

These tasks were created by industry experts with an average of 14 years of experience, drawn from companies including Google, Goldman Sachs, Microsoft, and others.

According to the benchmark results, OpenAI revealed that Anthropic’s Claude Opus 4.1 outperformed all other tested models, including GPT-5. 

“Claude Opus 4.1 was the best performing model on the GDPval gold subset, excelling in particular on aesthetics (e.g., document formatting, slide layout), while GPT-5 excelled in particular on accuracy (e.g., carefully following instructions, performing correct calculations),” read the study. 

Other models tested in the study included OpenAI’s GPT-4o, o4-mini, o3, Google’s Gemini 2.5 Pro and xAI’s Grok 4. The results of these models were evaluated through blind pairwise comparisons conducted by industry experts. 

“We analyse the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts,” said OpenAI. 

47.6% of deliverables by Claude Opus 4.1 were graded as better than (wins) or as good as (ties) the human deliverables, whereas the GPT-5 (high) scored 38.8%, and the o3-high scored 34.1%. 

Each task includes a request with reference files and requires producing a deliverable that mirrors real work products. 

The benchmark encompasses diverse file formats, including CAD design files, spreadsheets, slide decks, videos, and customer support conversations, with tasks requiring parsing through up to 38 reference files.

The test also evaluated various strengths and weaknesses of the models under consideration. 

“Claude, Grok, and Gemini most often lost due to instruction-following failures, while GPT-5 (high) lost mainly from formatting errors and had the fewest instruction-following issues,” read the study. 

“Gemini and Grok frequently promised but failed to provide deliverables, ignored reference data, or used the wrong format. GPT-5 and Grok showed the fewest accuracy errors, though all models sometimes hallucinated data or miscalculated.” 

For more evaluations, and details regarding the study, the technical report can be found here. 

Earlier studies from OpenAI this year, namely the PaperBench, and the SWE-Lancer also acknowledged that the model tested from Anthropic yielded the best results. “Kind of wild that OpenAI released a benchmark that Claude beats them in. Respect,” said Theo Browne, the founder and CEO of T3 Chat, in a post on X.

ALSO READ: Microsoft Develops New AI Chip Cooling System

Staff Writer
Staff Writer
The AI & Data Insider team works with a staff of in-house writers and industry experts.

Related

Unpack More