OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of modality discrepancy and domain shift in vision–tactile learning, which arise from non-standardized sensors and inconsistent data acquisition. To tackle these issues, we introduce a novel task—Single-Domain Generalization for Vision–Tactile Learning (SDG-VTL)—and propose the OmniVaT framework. OmniVaT maps visual and tactile embeddings into a unified embedding–frequency space and leverages a Multimodal Fractional Fourier Adapter (MFFA) together with a Discrete Tree Generation (DTG) module to construct hierarchical, diverse fractional-order representations. This approach achieves effective modality alignment and generalization to unseen domains without requiring multi-domain training or complex cross-modal fusion. Experiments demonstrate that OmniVaT significantly outperforms existing methods on the SDG-VTL task, marking the first successful realization of cross-domain multimodal learning under a single-domain training setting.

Technology Category

Application Category

📝 Abstract
Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.
Problem

Research questions and friction points this paper is trying to address.

visual-tactile learning
modality discrepancy
domain generalization
single domain
multimodal
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal visual-tactile learning
single domain generalization
fractional Fourier adapter
discrete tree generation
cross-domain generalization
🔎 Similar Papers
No similar papers found.
L
Liuxiang Qiu
Fujian Key Laboratory for Intelligent Processing and Wireless Transmission of Media Information, College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
H
Hui Da
College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China
Yuzhen Niu
Yuzhen Niu
Fuzhou University
Computer GraphicsComputer VisionMultimediaand Human Computer Interaction
Tiesong Zhao
Tiesong Zhao
Dept. Communication Engineering, Fuzhou University
Multimedia CommunicationVideo CodingImage Quality AssessmentHaptics
Yang Cao
Yang Cao
University of Science and Technology of China
computer visionimage processing
Z
Zheng-Jun Zha
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei 230026, China