Baseten, a San Francisco-based AI infrastructure startup, said it has built what it describes as the fastest API implementation of Z.ai’s open-source frontier model, GLM-5.2, delivering more than 280 tokens per second, according to measurements published by Artificial Analysis.
The company said the performance comes from a series of inference optimisations spanning model quantisation, cache management, routing, speculative decoding and infrastructure design.
The announcement was detailed in a technical post on X by Philip Kiely, Baseten’s Head of AI Education.
GLM-5.2 is Z.ai’s 744-billion-parameter mixture-of-experts model with 40 billion active parameters, support for up to a one-million-token context window, and an MIT licence.
The China-based model has drawn attention for benchmark results that place it alongside leading proprietary systems while offering significantly lower inference costs.
According to Baseten, the model is currently served through its model APIs and dedicated deployments, with customers including Notion.
The company said GLM-5.2 can deliver performance comparable to leading frontier models while costing 70–80% less per token.
A key part of the deployment is an in-house quantisation process that converts the model’s original FP8 weights into NVIDIA’s NVFP4 format for execution on Blackwell GPUs.
Baseten said testing showed the quantised model maintained comparable performance on agent-focused evaluations such as the Berkeley Function Calling Leaderboard (BFCL) while improving both throughput and latency.
The company also implemented cache-aware routing using NVIDIA’s Dynamo toolkit. Requests are directed to inference replicas that already contain relevant key-value cache data, reducing repeated prefill computation and lowering time-to-first-token latency.
Baseten reported an average time-to-first-token performance of roughly 800 milliseconds for GLM-5.2 workloads, as measured by Artificial Analysis.
Another major optimisation involves separating the prefill and decode stages of inference. Rather than processing both workloads on the same GPU node, Baseten runs them on dedicated infrastructure pools.
The company said this “prefill-decode disaggregation” doubled tokens-per-second performance compared with conventional aggregated deployments in internal benchmarks.
Baseten further boosted throughput by supporting GLM-5.2’s Multi-Token Prediction layers, a speculative decoding technique that generates multiple draft tokens in a single forward pass before verification.
The company said the approach provides a lossless speed improvement and that additional optimisation opportunities remain.
GLM-5.2 has generated significant attention across the AI industry since its launch earlier this month, with developers and startup founders highlighting its performance on coding and agentic software engineering tasks.
According to Z.ai, the open-weight model matches or exceeds leading proprietary systems on several long-horizon coding benchmarks, edging out OpenAI’s GPT-5.5 on FrontierSWE while trailing Anthropic’s Claude Opus 4.8 by roughly one percentage point.
“In practice, GLM-5.2 meets or exceeds the capabilities suggested by its benchmarks. It’s a genuinely great model for writing code, operating agents, and other frontier language model tasks,” wrote Keily.
ALSO READ: Alteryx Inspire 2026: Three Questions Every Data Leader Should Take to Orlando
