🤖 AI Summary
Existing text/video-to-audio generation methods suffer from limitations in audio fidelity, cross-modal alignment accuracy, and controllability over duration and loudness; video-driven approaches often require additional alignment-specific training. This paper introduces the first LLM-augmented diffusion-based audio agent framework for stepwise, instruction-guided generation, enabling high-fidelity, long-duration, multi-event audio synthesis and fine-grained editing from either text or video input. Key innovations include: (1) a timestamp-free semantic-temporal joint alignment strategy; and (2) an integrated architecture combining the TTA diffusion model, GPT-4 for instruction decomposition, fine-tuned Gemma2-2B-it, cross-modal conditional encoding, and a multi-stage agent scheduling mechanism. Experiments demonstrate state-of-the-art performance on both text-to-audio (TTA) and video-to-audio (VTA) tasks, with low training overhead, zero-shot editability, and robust variable-length generation capability.
📝 Abstract
We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio-Agent can generate high-quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable-length and variable-volume generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio, a process that can be tedious and time-consuming. Instead, we propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions that bridge the video and audio modality. Consequently, our framework contributes a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.