For most practitioners, synthetic data still means anonymising customer records to skirt GDPR and HIPAA. SandboxAQ’s VP of Engineering Stefan Leichenauer draws a sharper line: when modelling the physical world, “privacy is obviously a non-issue,” and the real value is scaling datasets beyond what labs can supply, anchored in physics equations we know how to solve.
SandboxAQ’s SAIR dataset exemplifies this shift. By fusing public assay data with 5.2 million simulated 3D protein–ligand complexes via Large Quantitative Models on NVIDIA DGX Cloud, SAIR created the scale needed to train tools like NVIDIA’s latest DiffDock—capabilities that “simply would not exist” without synthetic foundations. Leichenauer stresses this is no shortcut: “Organic data is necessary for validation. It’s how you know you’ve done the right thing.”
Yet enterprise AI lacks physics’ ground truth, amplifying risks in areas like financial modelling where synthetic scenarios carry inherent uncertainty. Leichenauer’s playbook demands strict held-out validation sets, data quality metrics, and one non-negotiable red line: synthetic data must never become the final judge of success.
In this exclusive Q&A, Leichenauer reveals why physics-grounded synthetic data is powering the next wave of scientific AI — and draws hard lines on where enterprises must always defer to reality.
SandboxAQ recently released SAIR, a public repository of around 5.2 million synthetic 3D protein–ligand structures with associated binding data. How does that project illustrate the strengths and weaknesses of synthetic data compared to experimental ‘organic’ measurements?
SandboxAQ is busy creating and utilizing a variety of synthetic datasets. SAIR for protein-ligand structures is one example, another is AQCat25 for catalysts. These datasets are critical to building accurate computational models of the physical world. For instance, NVIDIA recently incorporated SAIR into the training of the latest version of DiffDock, which is an AI model that predicts binding of ligands and proteins. Such models are invaluable in domains such as drug discovery, and we need scaled datasets like SAIR in order to make them. It’s not practical to achieve that level of scale with ‘organic’ data alone; synthetic data is a necessary part of the process.
Of course, the synthetic data and trained models need to be validated with experimental measurements. The real world is the ultimate judge. But what we’re seeing in scientific domains is that our ability to generate physically-accurate synthetic data has reached the point where it is reliable enough to form the foundation of a new generation of tools, tools which simply would not exist without the synthetic data.
Across your LQM and simulation work, where have you seen synthetic data actually improve downstream model performance compared to training only on scarce real-world data?
The foundation of all of our LQM work at SandboxAQ in the field of simulation is on the utility of synthetic data. The playbook is very simple: we generate synthetic data using well-understood, reliable computational methods, and then we train downstream models to extrapolate from that data.
There are two layers of validation. First, the downstream model must achieve comparable accuracy to the original computational methods. Second, we use the downstream model to make predictions about new real-world scenarios that have not yet been tested and then validate those predictions in the lab. The level of scarcity of real-world data is such that there isn’t even a comparison to make between “real-world only” and “synthetic”: you simply wouldn’t have a downstream model if you didn’t use synthetic data.
ALSO READ: Why Data Leaders Are Wary of a Synthetic Future
Research on computer vision and object detection shows that models trained exclusively on synthetic images often underperform on real-world test sets, even when you have far more synthetic data points, due to the simulation-to-reality gap. How do you quantify and manage that gap in your own pipelines?
Our synthetic data generation is based on a first-principles understanding of the physical world. The way we generate synthetic data is by simulating what would actually happen according to the laws of physics, that is, according to fundamental equations. The equations are true, and we have many techniques to solve them. We also know the limitations of those techniques quite well: we can estimate how accurate our synthetic data is. All of this is just the next evolution of computational science, which has been tackling the same kinds of issues for the last hundred years. Doing it properly is why we have so many physicists at SandboxAQ!
Given how easy it is to generate huge volumes of synthetic data now, what new governance or evaluation practices do enterprises need so they don’t silently degrade model robustness or fairness?
The most important thing is to define benchmarks and test against them regularly. More data is not always better data, and curating high quality data is also a critical activity. There should be metrics for the data itself that measure how good it is; whether it is quality of labels, statistical properties, or some quantitative measure of accuracy. One simple evaluation practice is to maintain a strict validation dataset that is never used for training and regularly test against it for bias and model drift.
If a CDO or Head of Data Science came to you today asking when they should not use synthetic data, what would be your non-negotiable red lines?
The most important thing to remember is that downstream validation must always be with real-world data, and that that is the true measure of success. Synthetic data generation is a useful intermediate step in many cases, but it should not be used as the final judge of success. We should also not use synthetic data as a substitute for fundamental understanding. The methods we use to generate our synthetic data might be incomplete, and the data we create will likewise be incomplete. It is not the ground truth. Reliability of synthetic data must always be questioned, and we have to strive for foundational understanding of our data and the ways we generate it.
Leichenauer’s red lines cut through the synthetic data hype: real-world validation remains non-negotiable, and simulation must never replace fundamental understanding of the systems it models.
ALSO READ: Data as the New Diagnostic: How Ahead Health is Turning Algorithms Into Preventive Care