Databricks Launches OfficeQA, New Benchmark for Testing AI on Core Enterprise Tasks

Databricks tested several frontier agents, including a GPT-5.1 agent using OpenAI’s File Search and Retrieval API.

Share

Databricks has introduced OfficeQA, a new benchmark designed to assess whether AI agents can handle the grounded, document-heavy reasoning that dominates real enterprise work. 

Unlike existing stress tests such as GDPval, ARC-AGI-2 or Humanity’s Last Exam, Databricks argues these do not reflect “the kinds of tasks that are important to our customers.” 

OfficeQA is meant to fill that gap by evaluating how well AI systems retrieve, parse and reason over sprawling, messy, real-world corpora.

The benchmark is built from the US Treasury Bulletins spanning more than eight decades, a corpus of roughly 89,000 pages of scanned tables, charts and narrative updates about federal finances.

The Mosaic Research team at Databricks describes it as a proxy for “economically valuable tasks performed by Databricks’ enterprise customers,” where accuracy is unforgiving and even “being off by one on a product or invoice number can have catastrophic downstream results.”

OfficeQA contains 246 questions across easy and hard tiers, each requiring information retrieval across multiple documents and grounded analytical reasoning. 

Example questions include retrieving the total U.S. national defense expenditures for the 1940 calendar year, running a linear regression to predict the Department of Agriculture’s 1999 outlays using data from 1990–1998, or interpreting visuals such as counting the number of local maxima on a line plot from the September 1990 Treasury Bulletin.

Human evaluators needed an average of 50 minutes per question, most of it spent locating data buried across decades of publications. 

Databricks filtered out any item that could be answered with an LLM’s memorised knowledge or through a simple web search, ensuring that “questions require document-grounded retrieval.”

Databricks tested several frontier agents, including a GPT-5.1 agent using OpenAI’s File Search and Retrieval API and a Claude Opus 4.5 agent built with Anthropic’s SDK. Performance was weak when models were asked to work directly from PDFs. 

Without access to the corpus, models answered about 2% of questions correctly. When given only PDFs, accuracy rose but remained below 45%. 

Anthropic’s Claude Opus 4.5 Agent solved 37.4% whereas OpenAI’s GPT 5.1 Agent scored 43.1% on the full data set. However, on OfficeQA-Hard, a subset of 113 hard examples, Claude Opus 4.5 Agent scored 21.1% and GPT5.1 Agent scored 24.8%.

“Despite frontier models performing well on Olympiad-style questions, we find they still struggle on these economically important tasks,” said Databricks. 

Even when provided with access to the exact document slices containing the answer, raw PDF interpretation caused large errors.

Significant gains emerged only after preprocessing the corpus with Databricks’ own parsing system. “When these same pages are preprocessed using Databricks ai_parse_document, performance jumps significantly,” the researchers write, noting a +32.4-point jump for GPT-5.1. 

One visual task — counting local maxima on a 1990 Treasury plot — was not solved by any AI agent.

ALSO READ: NVIDIA Gains US Approval to Export H200 AI Chips to China

Staff Writer
Staff Writer
The AI & Data Insider team works with a staff of in-house writers and industry experts.

Related

Unpack More