AI Infra Startup Baseten Announces GLM-5.2’s Fastest API Yet

Share

Baseten, a San Francisco-based AI infrastructure startup, said it has built what it describes as the fastest API implementation of Z.ai’s open-source frontier model, GLM-5.2, delivering more than 280 tokens per second, according to measurements published by Artificial Analysis.

The company said the performance comes from a series of inference optimisations spanning model quantisation, cache management, routing, speculative decoding and infrastructure design.

The announcement was detailed in a technical post on X by Philip Kiely, Baseten’s Head of AI Education.

GLM-5.2 is Z.ai’s 744-billion-parameter mixture-of-experts model with 40 billion active parameters, support for up to a one-million-token context window, and an MIT licence.

The China-based model has drawn attention for benchmark results that place it alongside leading proprietary systems while offering significantly lower inference costs.

According to Baseten, the model is currently served through its model APIs and dedicated deployments, with customers including Notion.

The company said GLM-5.2 can deliver performance comparable to leading frontier models while costing 70–80% less per token.

A key part of the deployment is an in-house quantisation process that converts the model’s original FP8 weights into NVIDIA’s NVFP4 format for execution on Blackwell GPUs.

Baseten said testing showed the quantised model maintained comparable performance on agent-focused evaluations such as the Berkeley Function Calling Leaderboard (BFCL) while improving both throughput and latency.

The company also implemented cache-aware routing using NVIDIA’s Dynamo toolkit. Requests are directed to inference replicas that already contain relevant key-value cache data, reducing repeated prefill computation and lowering time-to-first-token latency.

Baseten reported an average time-to-first-token performance of roughly 800 milliseconds for GLM-5.2 workloads, as measured by Artificial Analysis.

Another major optimisation involves separating the prefill and decode stages of inference. Rather than processing both workloads on the same GPU node, Baseten runs them on dedicated infrastructure pools.

The company said this “prefill-decode disaggregation” doubled tokens-per-second performance compared with conventional aggregated deployments in internal benchmarks.

Baseten further boosted throughput by supporting GLM-5.2’s Multi-Token Prediction layers, a speculative decoding technique that generates multiple draft tokens in a single forward pass before verification.

The company said the approach provides a lossless speed improvement and that additional optimisation opportunities remain.

GLM-5.2 has generated significant attention across the AI industry since its launch earlier this month, with developers and startup founders highlighting its performance on coding and agentic software engineering tasks.

According to Z.ai, the open-weight model matches or exceeds leading proprietary systems on several long-horizon coding benchmarks, edging out OpenAI’s GPT-5.5 on FrontierSWE while trailing Anthropic’s Claude Opus 4.8 by roughly one percentage point.

“In practice, GLM-5.2 meets or exceeds the capabilities suggested by its benchmarks. It’s a genuinely great model for writing code, operating agents, and other frontier language model tasks,” wrote Keily.

ALSO READ: Alteryx Inspire 2026: Three Questions Every Data Leader Should Take to Orlando

Staff Writer

The AI & Data Insider team works with a staff of in-house writers and industry experts.

Join Our Core Community

From Generic Models to Living Twins: A Practitioner’s Guide to ML in Design Workflows

Designing AI‑Ready Public Infrastructure: Global Lessons from India’s Aadhaar Builder

What “High-Risk AI” Actually Means for the Teams Running HR, Finance and Customer Ops

DXC’s LabX is Beating AI Theatre

Scaling Telehealth Without Scaling Fraud: The Case for an AI Trust Layer

Banks Are Drowning in Data and Starving for Insight

Unstructured Data, Deterministic Answers

Data Layer Precedes Compute, GPU Capacity in Sovereign AI

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

Los Angeles Opens DATALAND, World’s First Museum Featuring Solely AI Art