FAME: Fairness-aware Attention-modulated Video Editing

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing training-free video editing (VE) models tend to reinforce gender stereotypes when prompted with occupation-related terms. To address this, we propose a fine-tuning-free fairness-aware editing method: it injects soft fairness embeddings into the text encoder to introduce debiasing priors; designs temporally decaying, region-constrained attention masks; and integrates fairness-sensitive similarity masks into both temporal self-attention and cross-attention layers—jointly ensuring semantic alignment, motion consistency, and gender fairness. The method is entirely training-free and compatible with mainstream VE frameworks. Evaluated on our newly constructed FairVE benchmark—a fairness-oriented video editing benchmark—our approach achieves significant improvements in fairness alignment (+23.6%) and semantic fidelity (+18.4%) over all baselines. This work establishes a scalable, training-free paradigm for fair video editing.

Technology Category

Application Category

📝 Abstract

Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose extbf{FAME} for extit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark extit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.

Problem

Research questions and friction points this paper is trying to address.

Mitigating gender biases in profession-related video editing prompts

Preserving temporal consistency while integrating fairness embeddings

Balancing fairness alignment with semantic fidelity in video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting debiasing tokens into text encoder

Modulating temporal self attention with constrained masks

Reweighting cross attention using fairness sensitive similarity

🔎 Similar Papers

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing