Shopify Builds Qwen3-32B Agent For 68% Cheaper Store Automations

Canadian e-commerce giant Shopify has replaced a closed-source frontier model with a fine-tuned version of Alibaba’s Qwen3-32B to power its Shopify Flow store automations.

The custom tool-calling agent, which serves the Sidekick commerce assistant, operates 2.2x faster and costs 68% less than the previous proprietary system, Shopify said.

Shopify Flow is an automation platform that allows merchants to build “if-this-then-that” workflows, such as tagging high-value orders, without writing code.

Sidekick is the platform’s conversational AI assistant designed to execute these tasks through natural language. Previously, Sidekick relied on external “frontier” models to interpret a merchant’s plain-English request and turn it into a functional automation.

To build the new in-house system, Shopify reverse-engineered training data from thousands of existing user workflows.

A key technical hurdle was Flow’s native data format, a complex JSON-based domain-specific language that AI models often struggle to process accurately. The team solved this by translating the data into Python, a language the model already understood from its pretraining phase.

“Moving the target format from out-of-distribution to in-distribution turned the problem from ‘learn a new language and the task’ into ‘learn the task’,” stated Shopify’s engineering team.

This shift improved the model’s ability to write valid code by 22 points. The approach draws on prior art like SPEAC and WorkflowLLM, but distinguishes itself through a full round-trip transpiler that ensures the model’s Python internal language converts perfectly back to Shopify’s production JSON.

While the model performed well in controlled tests, initial deployment to 1% of traffic revealed a 35% lower workflow activation rate. Precise data logging showed that 25% of failures involved email-specific workflows and 16% involved complex condition patterns.

Furthermore, merchants frequently requested to edit existing flows, a capability entirely missing from the initial synthetic training data.

To close this gap, Shopify implemented a “weekly retraining flywheel.” The company noted that while activation rate is a helpful guardrail, it is a “noisy” metric that reflects merchant behaviour rather than pure model quality. Instead, it optimised for a domain-expert-calibrated LLM judge to grade accuracy.

The training process is powered by Tangle, Shopify’s open-source ML experimentation platform. The current pipeline runs on two nodes of H200 GPUs using Fully Sharded Data Parallel, allowing a full training run to complete in just 12 hours.

This infrastructure enabled the team to maintain a high iteration velocity, performing multiple experimental runs between weekly production updates.

By moving away from “rented ground,” Shopify aims to build lasting differentiation through its unique merchant interaction data. “The version running right now is already worse than the one retraining behind it,” the company stated, confirming plans to expand this fine-tuning recipe to other Sidekick skills across its ecosystem.

ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit

Join Our Core Community

Inside the Orchestration Crisis: Why AI‑Driven Enterprises Need a Control Plane, Not More Tools

AI Sovereignty is Really About Managing Dependencies, Not Going it Alone

The Agentic AI Blast Radius: Capability, Control, and Consequences

Start With the Context Layer First: A Framework for Production-Ready AI Agents

From Renders to Data Layers: How AI Is Reshaping Architecture’s Visualisation Stack

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

Inside IBM’s 11 Billion Dollar Bet: What the Confluent Deal Reveals About AI’s Investment Paradox

“Synthetic Data Is Not the Ground Truth” — SandboxAQ’s VP of Engineering on Simulation’s Power and Limits

Data as the New Diagnostic: How Ahead Health is Turning Algorithms Into Preventive Care