🤖 AI Summary
Traditional image editing relies on manual text prompts, posing significant accessibility barriers for individuals with motor or speech impairments. To address this, we propose LoongX—the first hands-free image editing framework driven by multimodal neurophysiological signals (EEG, fNIRS, PPG, and head motion). Methodologically, LoongX introduces (1) a cross-scale state-space module with dynamic gating fusion to efficiently decode heterogeneous physiological signals, and (2) the first neural-signal-driven diffusion Transformer (DiT), enabled by contrastive pretraining and cross-modal feature alignment for semantically controllable editing. Experiments demonstrate that LoongX achieves performance on par with text-based methods on CLIP-I and DINO metrics; when augmented with speech input, it further surpasses them on CLIP-T. These results validate the feasibility and advancement of neurosignal-driven generative models for accessible, creative human–AI interaction.
📝 Abstract
Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.