Neural-Driven Image Editing

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional image editing relies on manual text prompts, posing significant accessibility barriers for individuals with motor or speech impairments. To address this, we propose LoongX—the first hands-free image editing framework driven by multimodal neurophysiological signals (EEG, fNIRS, PPG, and head motion). Methodologically, LoongX introduces (1) a cross-scale state-space module with dynamic gating fusion to efficiently decode heterogeneous physiological signals, and (2) the first neural-signal-driven diffusion Transformer (DiT), enabled by contrastive pretraining and cross-modal feature alignment for semantically controllable editing. Experiments demonstrate that LoongX achieves performance on par with text-based methods on CLIP-I and DINO metrics; when augmented with speech input, it further surpasses them on CLIP-T. These results validate the feasibility and advancement of neurosignal-driven generative models for accessible, creative human–AI interaction.

Technology Category

Application Category

📝 Abstract
Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.
Problem

Research questions and friction points this paper is trying to address.

Enables hands-free image editing via neural signals
Integrates multimodal neurophysiological data for user intent
Aligns cognitive states with semantic editing intentions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multimodal neurophysiological signals for editing
Integrates CS3 and DGF modules for signal processing
Aligns cognitive states with semantic intentions via contrastive learning
🔎 Similar Papers
No similar papers found.