FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address over-detection in railway defect detection caused by visual similarity in single-image modalities (e.g., YOLO), this paper proposes a domain-rule-guided vision-audio multimodal fusion method. Our approach innovatively integrates YOLOv8n for efficient defect localization and Vision Transformer (ViT) for deep semantic feature extraction, fusing multi-scale visual representations from ViT layers 7, 16, and 19 with synthetically generated audio signals to enable complementary cross-modal modeling. Evaluated on a real-world railway dataset, the proposed architecture achieves a 0.2 percentage-point improvement in mAP and overall accuracy over the best-performing unimodal baseline; paired t-tests confirm statistical significance (p < 0.05). This work establishes a novel, interpretable, and robust paradigm for industrial defect detection under low-contrast and high-confusion conditions.

Technology Category

Application Category

📝 Abstract
Multimodal fusion is a multimedia technique that has become popular in the wide range of tasks where image information is accompanied by a signal/audio. The latter may not convey highly semantic information, such as speech or music, but some measures such as audio signal recorded by mics in the goal to detect rail structure elements or defects. While classical detection approaches such as You Only Look Once (YOLO) family detectors can be efficiently deployed for defect detection on the image modality, the single modality approaches remain limited. They yield an overdetection in case of the appearance similar to normal structural elements. The paper proposes a new multimodal fusion architecture built on the basis of domain rules with YOLO and Vision transformer backbones. It integrates YOLOv8n for rapid object detection with a Vision Transformer (ViT) to combine feature maps extracted from multiple layers (7, 16, and 19) and synthesised audio representations for two defect classes: rail Rupture and Surface defect. Fusion is performed between audio and image. Experimental evaluation on a real-world railway dataset demonstrates that our multimodal fusion improves precision and overall accuracy by 0.2 points compared to the vision-only approach. Student's unpaired t-test also confirms statistical significance of differences in the mean accuracy.
Problem

Research questions and friction points this paper is trying to address.

Multimodal fusion for railway defect detection
Overcoming single modality limitations in defect identification
Integrating audio and image data for improved accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion with YOLOv8n and Vision Transformer backbones
Integrates image feature maps and synthesized audio representations
Fusion between audio and image modalities improves detection accuracy
🔎 Similar Papers
No similar papers found.
A
Alexey Zhukov
University Bordeaux, CNRS, Bordeaux INP, INRIA, LaBRI
Jenny Benois-Pineau
Jenny Benois-Pineau
professeur en informatique, Université Bordeaux
pattern recognitionartificial intelligencemachine learningmotion estimationmultimedia
A
Amira Youssef
SNCF RESEAU, Directions Techniques Réseau, DGII DTR IP3M, DM Matrice
A
Akka Zemmari
University Bordeaux, CNRS, Bordeaux INP, INRIA, LaBRI
Mohamed Mosbah
Mohamed Mosbah
Professeur d'informatique, Bordeaux INP
méthodes formellesalgorithmes distribuéssécuritémobilité et transports intelligents
V
Virginie Taillandier
SNCF, DIR TECHNOLOGIES INNOVATION ET PROJETS GROUPE, IR - DPISF TECH4RAIL - TLI