Canadian e-commerce giant Shopify has replaced a closed-source frontier model with a fine-tuned version of Alibaba’s Qwen3-32B to power its Shopify Flow store automations.
The custom tool-calling agent, which serves the Sidekick commerce assistant, operates 2.2x faster and costs 68% less than the previous proprietary system, Shopify said.
Shopify Flow is an automation platform that allows merchants to build “if-this-then-that” workflows, such as tagging high-value orders, without writing code.
Sidekick is the platform’s conversational AI assistant designed to execute these tasks through natural language. Previously, Sidekick relied on external “frontier” models to interpret a merchant’s plain-English request and turn it into a functional automation.
To build the new in-house system, Shopify reverse-engineered training data from thousands of existing user workflows.
A key technical hurdle was Flow’s native data format, a complex JSON-based domain-specific language that AI models often struggle to process accurately. The team solved this by translating the data into Python, a language the model already understood from its pretraining phase.
“Moving the target format from out-of-distribution to in-distribution turned the problem from ‘learn a new language and the task’ into ‘learn the task’,” stated Shopify’s engineering team.
This shift improved the model’s ability to write valid code by 22 points. The approach draws on prior art like SPEAC and WorkflowLLM, but distinguishes itself through a full round-trip transpiler that ensures the model’s Python internal language converts perfectly back to Shopify’s production JSON.
While the model performed well in controlled tests, initial deployment to 1% of traffic revealed a 35% lower workflow activation rate. Precise data logging showed that 25% of failures involved email-specific workflows and 16% involved complex condition patterns.
Furthermore, merchants frequently requested to edit existing flows, a capability entirely missing from the initial synthetic training data.
To close this gap, Shopify implemented a “weekly retraining flywheel.” The company noted that while activation rate is a helpful guardrail, it is a “noisy” metric that reflects merchant behaviour rather than pure model quality. Instead, it optimised for a domain-expert-calibrated LLM judge to grade accuracy.
The training process is powered by Tangle, Shopify’s open-source ML experimentation platform. The current pipeline runs on two nodes of H200 GPUs using Fully Sharded Data Parallel, allowing a full training run to complete in just 12 hours.
This infrastructure enabled the team to maintain a high iteration velocity, performing multiple experimental runs between weekly production updates.
By moving away from “rented ground,” Shopify aims to build lasting differentiation through its unique merchant interaction data. “The version running right now is already worse than the one retraining behind it,” the company stated, confirming plans to expand this fine-tuning recipe to other Sidekick skills across its ecosystem.
ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit
