Emu Video: Factorizing Text-to-Video Generation by Explicit Image...
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on...