🤖 AI Summary
This work addresses the limitations of current deep vision models, which rely on homogeneous spatial layer stacking and thus fail to emulate the decoupling between low-level perception and high-level cognition observed in human vision, resulting in opaque decision-making. To overcome this, the authors propose a psychophysically inspired deep visual encoding framework that uniquely integrates complex-valued representations with data-driven frequency-domain filtering. This approach learns task-relevant semantic structures within distinct frequency subbands, yielding intermediate abstract representations analogous to those in biological vision. By doing so, the model substantially reduces its dependence on deep architectural stacking while maintaining competitive performance, and simultaneously produces highly interpretable object-part features that enhance both transparency and computational efficiency.
📝 Abstract
Psychovisual models suggest human vision decouples low-level feature extraction from higher cognition by first forming intermediate abstractions. In contrast, deep learning-based vision models routinely extract and aggregate features using homogeneous stacks of spatial layers, rendering their decision-making processes opaque. In this paper, we propose Deep Visual Coding, a learned frequency-domain representation inspired by 1990s image codes that quantised perceptually salient frequencies, which together with complex-valued image representations produces psychovisual-style abstractions. This approach enables the first psychovisual-based deep learning framework, utilizing data-driven spectral filters that learn to encode task-relevant semantic structures within distinct frequency sub-bands. Salience analyses reveal that our psychovisual models extract highly interpretable object parts compared to the amorphous regions produced by regular Convolutional Neural Networks (CNNs). Furthermore, we find that our models are less depth dependent than CNNs for model scaling, since our complex-valued representations and learned abstractions subsume the role of the deep spatial layers. Together, these findings demonstrate that psychovisual coding provides a promising path toward more efficient and transparent vision models.