🤖 AI Summary
This work addresses identity drift and cross-modal inconsistency in controllable multimodal generation, which arise from the lack of explicit structure in semantic attribute evolution. To this end, the authors propose Controlla, a novel framework that, for the first time, integrates graph priors with optimal transport to construct a structured latent space. By imposing graph-based constraints, Controlla explicitly aligns identity and attribute factors, guiding attribute evolution along consistent geometric trajectories. The method introduces a geometry-aware metric to evaluate trajectory consistency and disentanglement and establishes AffectHuman-43K, a leakage-resistant multimodal benchmark. Experimental results demonstrate that Controlla significantly outperforms existing approaches in controllability, identity preservation, and cross-modal alignment, confirming its advantages in graph sensitivity, scalability, and robustness.
📝 Abstract
Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.