🤖 AI Summary
This paper addresses the instability of feature representations for natural images under spatiotemporal transformations—including scaling, affine deformations, Galilean motion, and temporal scaling—by establishing a unified theoretical framework for joint covariant parameterization of receptive fields under geometric transformations. We introduce *affine-normalized derivatives*, generalizing classical scale normalization to the full affine group, and rigorously derive coupled transformation laws governing receptive field scale, orientation, and velocity parameters across diverse spatiotemporal transformations. Leveraging generalized Gaussian derivatives, affine Gaussian kernels, spatiotemporal differential operators, and Lie group covariance analysis, we formulate a local linearized perspective modeling framework. Experiments demonstrate that the theory enables robust receptive field response matching for dynamic surface patches and spatiotemporal events across viewpoints and velocities. This provides a principled, covariant foundation for invariant low-level representation in both biological and artificial vision systems.
📝 Abstract
The influence of natural image transformations on receptive field responses is crucial for modelling visual operations in computer vision and biological vision. In this regard, covariance properties with respect to geometric image transformations in the earliest layers of the visual hierarchy are essential for expressing robust image operations, and for formulating invariant visual operations at higher levels. This paper defines and proves a set of joint covariance properties for spatio-temporal receptive fields in terms of spatio-temporal derivative operators applied to spatio-temporally smoothed image data under compositions of spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations. Specifically, the derived relations show how the parameters of the receptive fields need to be transformed, in order to match the output from spatio-temporal receptive fields under composed spatio-temporal image transformations. For this purpose, we also fundamentally extend the notion of scale-normalized derivatives to affine-normalized derivatives, that are computed based on spatial smoothing with affine Gaussian kernels, and analyze the covariance properties of the resulting affine-normalized derivatives for the affine group as well as for important subgroups thereof. We conclude with a geometric analysis, showing how the derived joint covariance properties make it possible to relate or match spatio-temporal receptive field responses, when observing, possibly moving, local surface patches from different views, under locally linearized perspective or projective transformations, as well as when observing different instances of spatio-temporal events, that may occur either faster or slower between different views of similar spatio-temporal events.