🤖 AI Summary
Existing machine learning approaches typically model action understanding and embodied execution in isolation, neglecting their intrinsic coupling. To address this, we draw inspiration from the mirror neuron system and propose, for the first time in machine learning, a unified representation learning framework. Our method employs bilinear mapping to project intermediate representations of observation and execution into a shared latent space, and introduces a contrastive learning objective that maximizes their mutual information—thereby explicitly modeling spontaneous alignment at the representation level. Crucially, this framework requires no explicit alignment supervision yet enables bidirectional, mutually beneficial task enhancement. Experiments demonstrate substantial improvements in representation discriminability and cross-task generalization. The proposed approach delivers consistent performance gains across diverse tasks—including action recognition, imitation learning, and embodied control—validating its effectiveness as a unified foundation for perception–action integration.
📝 Abstract
Mirror neurons are a class of neurons that activate both when an individual observes an action and when they perform the same action. This mechanism reveals a fundamental interplay between action understanding and embodied execution, suggesting that these two abilities are inherently connected. Nonetheless, existing machine learning methods largely overlook this interplay, treating these abilities as separate tasks. In this study, we provide a unified perspective in modeling them through the lens of representation learning. We first observe that their intermediate representations spontaneously align. Inspired by mirror neurons, we further introduce an approach that explicitly aligns the representations of observed and executed actions. Specifically, we employ two linear layers to map the representations to a shared latent space, where contrastive learning enforces the alignment of corresponding representations, effectively maximizing their mutual information. Experiments demonstrate that this simple approach fosters mutual synergy between the two tasks, effectively improving representation quality and generalization.