FAME: Fairness-aware Attention-modulated Video Editing

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing training-free video editing (VE) models tend to reinforce gender stereotypes when prompted with occupation-related terms. To address this, we propose a fine-tuning-free fairness-aware editing method: it injects soft fairness embeddings into the text encoder to introduce debiasing priors; designs temporally decaying, region-constrained attention masks; and integrates fairness-sensitive similarity masks into both temporal self-attention and cross-attention layers—jointly ensuring semantic alignment, motion consistency, and gender fairness. The method is entirely training-free and compatible with mainstream VE frameworks. Evaluated on our newly constructed FairVE benchmark—a fairness-oriented video editing benchmark—our approach achieves significant improvements in fairness alignment (+23.6%) and semantic fidelity (+18.4%) over all baselines. This work establishes a scalable, training-free paradigm for fair video editing.

Technology Category

Application Category

📝 Abstract
Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose extbf{FAME} for extit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark extit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.
Problem

Research questions and friction points this paper is trying to address.

Mitigating gender biases in profession-related video editing prompts
Preserving temporal consistency while integrating fairness embeddings
Balancing fairness alignment with semantic fidelity in video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting debiasing tokens into text encoder
Modulating temporal self attention with constrained masks
Reweighting cross attention using fairness sensitive similarity
Z
Zhangkai Wu
Macquarie University, Australia
X
Xuhui Fan
Macquarie University, Australia
Z
Zhongyuan Xie
Macquarie University, Australia
K
Kaize Shi
University of Southern Queensland, Australia
Zhidong Li
Zhidong Li
UTS
Machine LearningData science
Longbing Cao
Longbing Cao
Distinguished Chair Professor in AI & ARC Future Fellow (Level 3), Macquarie University
Artificial intelligenceData scienceMachine learningBehavior informaticsEnterprise innovation