Researchers from Meta and the University of California, Berkeley introduced an AI system called StreamDiT, which creates live video based on text descriptions. StreamDiT generates real-time video at a speed of 16 frames per second using only one powerful graphics card. The model contains 4 billion parameters and provides a resolution of 512p.
StreamDiT differs from previous solutions in that it creates video streamingly, frame by frame, rather than preparing the entire clip in advance. This allows the system to respond to interactive requests and change the video right during the broadcast.
The architecture of StreamDiT is built for fast processing: the system uses a buffer that allows working on several frames simultaneously and gradually improving their quality. To achieve versatility, the model was trained on 3,000 high-quality videos and a large set of 2.6 million clips.
StreamDiT showed better results than other models, including ReuseDiffuse and FIFO diffusion, especially for dynamic scenes. Evaluators noted the smoothness of motion, completeness of animation, and image quality in short videos up to eight seconds long. The team also tested a larger version of the model with 30 billion parameters, which provided even higher quality, although it worked slower.
StreamDiT is already capable of generating minute-long videos on the fly, responding to user requests, and editing video in real-time. Developers continue to work on improving the model’s memory and smoothing transitions between video fragments.