🤖 AI Summary
This study investigates whether mainstream vision models can perceive illusory motion in static images—such as the Rotating Snakes illusion—in a manner analogous to human vision. By evaluating multiple optical flow models under simulated saccadic eye movements and introducing a biologically inspired dual-channel architecture, the work systematically demonstrates for the first time that existing models generally fail to replicate human perception of static illusory motion. Experiments reveal that only the dual-channel model, when subjected to eye-movement simulation, produces rotational optical flow consistent with human perceptual reports. This capability hinges on the integration of luminance and higher-order chromatic features, coupled with a recurrent attention mechanism. The findings underscore the necessity of biologically inspired architectural design for accurately modeling human visual illusions.
📝 Abstract
Understanding human motion processing is essential for building reliable, human-centered computer vision systems. Although deep neural networks (DNNs) achieve strong performance in optical flow estimation, they remain less robust than humans and rely on fundamentally different computational strategies. Visual motion illusions provide a powerful probe into these mechanisms, revealing how human and machine vision align or diverge. While recent DNN-based motion models can reproduce dynamic illusions such as reverse-phi, it remains unclear whether they can perceive illusory motion in static images, exemplified by the Rotating Snakes illusion. We evaluate several representative optical flow models on Rotating Snakes and show that most fail to generate flow fields consistent with human perception. Under simulated conditions mimicking saccadic eye movements, only the human-inspired Dual-Channel model exhibits the expected rotational motion, with the closest correspondence emerging during the saccade simulation. Ablation analyses further reveal that both luminance-based and higher-order color--feature--based motion signals contribute to this behavior and that a recurrent attention mechanism is critical for integrating local cues. Our results highlight a substantial gap between current optical-flow models and human visual motion processing, and offer insights for developing future motion-estimation systems with improved correspondence to human perception and human-centric AI.