🤖 AI Summary
This work addresses the challenge faced by video creators in aligning background music with emotional tone and narrative intent, compounded by the lack of efficient evaluation methods. The authors propose a natural language–guided approach for generating and iteratively refining video soundtracks: diverse musical pieces are synthesized using a text-to-music model, and their emotional attributes—specifically valence and arousal—are translated into visual cues. These cues, integrated with salient visual content extracted from the video, generate contextualized thumbnails that facilitate rapid browsing and comparison of candidate soundtracks. User studies demonstrate that this method significantly enhances soundtrack review efficiency, with participants consistently reporting that the system is both engaging and creatively inspiring.
📝 Abstract
Music shapes the tone of videos, yet creators often struggle to find soundtracks that match their video's mood and narrative. Recent text-to-music models let creators generate music from text prompts, but our formative study (N=8) shows creators struggle to construct diverse prompts, quickly review and compare tracks, and understand their impact on the video. We present VidTune, a system that supports soundtrack creation by generating diverse music options from a creator's prompt and producing contextual thumbnails for rapid review. VidTune extracts representative video subjects to ground thumbnails in context, maps each track's valence and energy onto visual cues like color and brightness, and depicts prominent genres and instruments. Creators can refine tracks through natural language edits, which VidTune expands into new generations. In a controlled user study (N=12) and an exploratory case study (N=6), participants found VidTune helpful for efficiently reviewing and comparing music options and described the process as playful and enriching.