NVIDIA has released Cosmos 3, a new open-source foundation model for physical AI that combines physical reasoning, world generation and action generation within a single architecture.
The company is making Cosmos 3 Nano (8B parameters) and Cosmos 3 Super (32B parameters) available, along with training scripts, deployment tools, model checkpoints, and six synthetic datasets for robotics, autonomous driving, warehouse automation, and other physical AI applications.
The release consolidates capabilities that were previously spread across separate Cosmos models. Cosmos 3 uses a Mixture-of-Transformers (MoT) architecture built around two interconnected components: a vision-language ‘Reasoner’ tower and a diffusion-based ‘Generator’ tower.
The Reasoner processes multimodal inputs, including text, images, videos, audio and actions to understand physical environments, while the Generator produces future observations and action sequences conditioned on that understanding.
According to NVIDIA, the architecture is designed to eliminate the need for orchestration across multiple models and inference pipelines.
The Reasoner can operate independently for perception and analysis tasks, while generation workloads activate both towers, allowing the model to combine scene understanding with predictive world modelling and action generation.
The company is releasing two model sizes targeting different deployment environments. Cosmos 3 Nano, with 8 billion parameters, is designed for workstation-class systems, including NVIDIA RTX PRO 6000 GPUs, and for real-time robotics applications.
Cosmos 3 Super, at 32 billion parameters, targets datacenter deployments on Hopper and Blackwell GPU platforms for synthetic data generation and large-scale physical reasoning workloads. It supports multiple input-output combinations, covering text-to-image generation, video prediction, video reasoning, action-conditioned world modelling and robot policy learning.
NVIDIA positions the model as a foundation for applications including robotic manipulation, autonomous driving systems, warehouse monitoring, smart spaces and embodied AI agents.
Alongside the models, NVIDIA is open-sourcing six synthetic data generation datasets through Hugging Face. The datasets cover embodied robot scenes, physical interaction simulations, spatial reasoning tasks, digital human environments, autonomous driving scenarios and warehouse operations.
Some datasets include physics annotations, such as object velocities, centre-of-mass displacements, and semantic segmentation labels, intended for post-training and evaluation workflows.
NVIDIA is also releasing post-training recipes covering supervised fine-tuning and action-oriented training. The workflows allow developers to adapt Cosmos 3 to domain-specific datasets and use cases, including forward-dynamics prediction, inverse-dynamics modelling, and policy generation for robotics systems.
For deployment, Cosmos 3 is available through NVIDIA NIM microservices. The initial release includes the Cosmos 3 Reasoner NIM, while a Generator NIM is planned.
NVIDIA also introduced Cosmos Human Evaluation (HUE), an open-source benchmark framework that evaluates generated videos using binary fact-verification questions across semantic alignment, physical laws, geometric reasoning and visual integrity.
The framework is intended to provide a more granular assessment of physical AI video generation systems than existing leaderboard-based evaluations.
According to NVIDIA, Cosmos 3 leads its parameter classes on VANTAGE-Bench and ranks at or near the top of several public physical AI and video-generation benchmarks, including PAI-Bench, R-Bench, Physics-IQ and RoboLab.
The company also said Cosmos 3 currently ranks as the leading open-source model on selected image and video-generation leaderboards tracked by Artificial Analysis.
ALSO READ: Alteryx Inspire 2026: Three Questions Every Data Leader Should Take to Orlando
