🤖 AI Summary
Conventional CNNs are excessively deep compared to biological visual systems and fail to accurately model retinal neural responses, particularly those of looming-sensitive neurons. Method: We propose Higher-Order Convolutional Neural Networks (HoCNN), which intrinsically embed multiplicative higher-order spatiotemporal interactions within convolutional kernels—enhancing representational capacity without increasing network depth. This design achieves the first native integration of higher-order computation with standard convolution, enabling natural learning of geometric transformations such as scale invariance. Contribution/Results: On salamander and mouse retinal recordings, HoCNN achieves a neural response prediction correlation coefficient of 0.75—approaching the physiological ceiling of 0.80±0.02—and outperforms baselines using only 50% of the training data. Scale parameter estimation attains a correlation of 0.72 (vs. 0.32 for baseline). HoCNN thus demonstrates superior biological plausibility, computational efficiency, and cross-species generalizability.
📝 Abstract
We present a novel approach to neural response prediction that incorporates higher-order operations directly within convolutional neural networks (CNNs). Our model extends traditional 3D CNNs by embedding higher-order operations within the convolutional operator itself, enabling direct modeling of multiplicative interactions between neighboring pixels across space and time. Our model increases the representational power of CNNs without increasing their depth, therefore addressing the architectural disparity between deep artificial networks and the relatively shallow processing hierarchy of biological visual systems. We evaluate our approach on two distinct datasets: salamander retinal ganglion cell (RGC) responses to natural scenes, and a new dataset of mouse RGC responses to controlled geometric transformations. Our higher-order CNN (HoCNN) achieves superior performance while requiring only half the training data compared to standard architectures, demonstrating correlation coefficients up to 0.75 with neural responses (against 0.80$pm$0.02 retinal reliability). When integrated into state-of-the-art architectures, our approach consistently improves performance across different species and stimulus conditions. Analysis of the learned representations reveals that our network naturally encodes fundamental geometric transformations, particularly scaling parameters that characterize object expansion and contraction. This capability is especially relevant for specific cell types, such as transient OFF-alpha and transient ON cells, which are known to detect looming objects and object motion respectively, and where our model shows marked improvement in response prediction. The correlation coefficients for scaling parameters are more than twice as high in HoCNN (0.72) compared to baseline models (0.32).