Yinghao Ma, a PhD candidate in the Centre for Digital Music at Queen Mary University of London, has helped develop AutoMV, the first open-source AI system capable of generating complete music videos directly from full-length songs.
Music-to-video generation remains a major challenge for generative AI. While recent video models can produce visually impressive short clips, they often struggle with long-form storytelling, musical alignment, and character consistency. AutoMV addresses these limitations by introducing a multi-agent AI system designed specifically for full-length music video production.
Developed through a collaboration between Queen Mary researchers and partners at Beijing University of Posts and Telecommunications, Nanjing University, Hong Kong University of Science and Technology, and the University of Manchester, AutoMV brings together expertise in music information retrieval, multimodal AI, and creative computing. The work was led by Dr Emmanouil Benetos, with contributions from Yinghao Ma as well as Dr. Changjae Oh and Chaoran Zhu from the Centre for Intelligent Sensing.
AutoMV works like a virtual film production team. First, it analyses a song’s musical structure, beats, and time-aligned lyrics. Then, a set of specialised AI agents—taking on roles such as screenwriter, director, and editor—collaborate to plan scenes, maintain character identity, and generate images and video clips. A final quality-control “verifier” agent checks for coherence and consistency, regenerating content where needed.
This approach allows AutoMV to produce music videos that follow a song from beginning to end, maintaining narrative flow and visual identity throughout. Human expert evaluations show that AutoMV significantly outperforms existing commercial tools, narrowing the gap between AI-generated videos and professionally produced music videos.
By lowering the cost of music video production from tens of thousands of pounds to roughly the cost of an API call, AutoMV has the potential to empower independent musicians, educators, and creators who previously lacked access to professional video production. As an open-source project, it also supports transparent, reproducible research and encourages community collaboration.
The team is actively inviting researchers and students to contribute to the codebase, extend the benchmark, and explore future directions for long-form, multimodal AI systems.
- Code: https://github.com/multimodal-art-projection/AutoMV
- Paper: https://arxiv.org/abs/2512.12196
- Project website: https://m-a-p.ai/AutoMV/
