Google DeepMind Unveils Genie 3

Genie 3 supports motion and interaction via a built-in latent action representation that handles character movement based on user inputs, allowing the system to simulate physics, character control, and environment responses in real time.

Share

Google DeepMind has unveiled a new AI model called Genie 3, a world model capable of generating interactive 3D environments from just a single image. The model, trained without supervision or environment labels, allows users to control the character in a simulated world derived from the input image.

Users just need to enter a text prompt describing the environment that the model will then simulate in real time at 24 frames per second, maintaining consistency at 720p for a few minutes.

Genie 3 is designed as a world model that predicts future frames, rewards, and actions based on video data. Unlike previous models, Genie learns in an unsupervised way, trained purely on internet videos and associated actions, without labelled environments or supervision.

It can generalise to new visual inputs, generating interactive, controllable environments without fine-tuning. The training set includes 30 million video clips paired with action traces, making it one of the largest unsupervised datasets for world modelling to date.

“Genie 3 is the first real-time interactive general-purpose world model,” Shlomi Fruchter, a Research Director at DeepMind, said during a press briefing. 

“It goes beyond narrow world models that existed before. It’s not specific to any particular environment. It can generate both photo-realistic and imaginary worlds, and everything in between.”

Users provide a single image, either drawn, rendered, or real-world. Genie 3 then:

  • Extracts the spatial layout from the image,
  • Uses its latent action model to understand possible movements,
  • Renders a dynamic, controllable 3D environment where the user can interact as if playing a side-scrolling video game.

According to DeepMind, the model supports motion and interaction via a built-in latent action representation that handles character movement based on user inputs. This allows the system to simulate physics, character control, and environment responses in real time.

Genie 3’s Pipeline includes:

  • Spatiotemporal Video Tokeniser: Converts video frames into discrete tokens for efficient learning.
  • Latent Action Model: Learns a compressed representation of actions from video-action pairs.
  • Dynamics Model: Predicts the next frames and states.
  • Renderer: Converts learned tokens back into realistic 3D-like frames.

All components were trained end-to-end, using data from open internet sources, without any game engine involvement. However, Genie 3 isn’t available for public preview yet and will be rolled out to a select group of creators for testing. 

Staff Writer
Staff Writer
The AI & Data Insider team works with a staff of in-house writers and industry experts.

Related

Unpack More