🤖 AI Summary
This work proposes a general-purpose approach to speaker-agnostic lip-sync generation by effectively integrating audiovisual features through a residual CBAM attention module embedded within a U-Net architecture. To enhance cross-modal alignment, a semantic alignment module is introduced to expand the receptive field, enabling precise matching between audio and visual representations. The method further incorporates the LPIPS perceptual loss to significantly improve the visual realism of generated faces and the consistency between audio and video. Experimental results demonstrate that the proposed method achieves state-of-the-art performance in both subjective evaluations and objective metrics, exhibiting strong generalization capabilities across unseen speakers and delivering high lip-sync accuracy and image quality.