Causal-HM: Restoring Physical Generative Logic in Multimodal Anomaly Detection via Hierarchical Modulation

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing unsupervised multimodal anomaly detection methods neglect the physical causal logic from welding process to outcome, treating process modalities (e.g., video, audio, sensor signals) and outcome modalities (e.g., post-weld images) equivalently, while struggling to fuse heterogeneous high-dimensional visual and low-dimensional sensor data. Method: This paper proposes a Sensor-Guided Causal Hierarchical Modeling (CHM) framework tailored for robotic welding, which modulates audio-visual features via sensor signals to establish a unidirectional generative mapping (“process → outcome”) and enforces multimodal physical consistency constraints. Contribution/Results: CHM explicitly encodes physical causality, overcoming both causal blindness and modality heterogeneity bottlenecks. Evaluated on Weld-4M—a newly constructed four-modal welding benchmark—the method achieves an image-level AUROC (I-AUROC) of 90.7%, significantly surpassing state-of-the-art approaches and attaining internationally leading performance.

Technology Category

Application Category

📝 Abstract

Multimodal Unsupervised Anomaly Detection (UAD) is critical for quality assurance in smart manufacturing, particularly in complex processes like robotic welding. However, existing methods often suffer from causal blindness, treating process modalities (e.g., real-time video, audio, and sensors) and result modalities (e.g., post-weld images) as equal feature sources, thereby ignoring the inherent physical generative logic. Furthermore, the heterogeneity gap between high-dimensional visual data and low-dimensional sensor signals frequently leads to critical process context being drowned out. In this paper, we propose Causal-HM, a unified multimodal UAD framework that explicitly models the physical Process to Result dependency. Specifically, our framework incorporates two key innovations: a Sensor-Guided CHM Modulation mechanism that utilizes low-dimensional sensor signals as context to guide high-dimensional audio-visual feature extraction , and a Causal-Hierarchical Architecture that enforces a unidirectional generative mapping to identify anomalies that violate physical consistency. Extensive experiments on our newly constructed Weld-4M benchmark across four modalities demonstrate that Causal-HM achieves a state-of-the-art (SOTA) I-AUROC of 90.7%. Code will be released after the paper is accepted.

Problem

Research questions and friction points this paper is trying to address.

Addresses causal blindness in multimodal anomaly detection

Bridges heterogeneity gap between visual and sensor data

Models physical process-to-result dependency for anomaly identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sensor-guided modulation for cross-modal feature extraction

Causal-hierarchical architecture enforcing unidirectional generative mapping

Explicit modeling of physical process-to-result dependency

🔎 Similar Papers

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

2024-03-17arXiv.orgCitations: 10

Authors to Follow