FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address over-detection in railway defect detection caused by visual similarity in single-image modalities (e.g., YOLO), this paper proposes a domain-rule-guided vision-audio multimodal fusion method. Our approach innovatively integrates YOLOv8n for efficient defect localization and Vision Transformer (ViT) for deep semantic feature extraction, fusing multi-scale visual representations from ViT layers 7, 16, and 19 with synthetically generated audio signals to enable complementary cross-modal modeling. Evaluated on a real-world railway dataset, the proposed architecture achieves a 0.2 percentage-point improvement in mAP and overall accuracy over the best-performing unimodal baseline; paired t-tests confirm statistical significance (p < 0.05). This work establishes a novel, interpretable, and robust paradigm for industrial defect detection under low-contrast and high-confusion conditions.

Technology Category

Application Category

📝 Abstract

Multimodal fusion is a multimedia technique that has become popular in the wide range of tasks where image information is accompanied by a signal/audio. The latter may not convey highly semantic information, such as speech or music, but some measures such as audio signal recorded by mics in the goal to detect rail structure elements or defects. While classical detection approaches such as You Only Look Once (YOLO) family detectors can be efficiently deployed for defect detection on the image modality, the single modality approaches remain limited. They yield an overdetection in case of the appearance similar to normal structural elements. The paper proposes a new multimodal fusion architecture built on the basis of domain rules with YOLO and Vision transformer backbones. It integrates YOLOv8n for rapid object detection with a Vision Transformer (ViT) to combine feature maps extracted from multiple layers (7, 16, and 19) and synthesised audio representations for two defect classes: rail Rupture and Surface defect. Fusion is performed between audio and image. Experimental evaluation on a real-world railway dataset demonstrates that our multimodal fusion improves precision and overall accuracy by 0.2 points compared to the vision-only approach. Student's unpaired t-test also confirms statistical significance of differences in the mean accuracy.

Problem

Research questions and friction points this paper is trying to address.

Multimodal fusion for railway defect detection

Overcoming single modality limitations in defect identification

Integrating audio and image data for improved accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion with YOLOv8n and Vision Transformer backbones

Integrates image feature maps and synthesized audio representations

Fusion between audio and image modalities improves detection accuracy

🔎 Similar Papers

No similar papers found.