Cross-Modal Alignment and Fusion for RGB-D Transmission-Line Defect Detection

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting small-scale defects in power transmission lines, where complex backgrounds and illumination variations hinder existing RGB-based methods from distinguishing subtle geometric structures. To overcome this limitation, the authors propose CMAFNet, a novel framework following a “purify-then-fuse” paradigm that explicitly aligns RGB and depth modalities while suppressing noise through a codebook-driven feature purification mechanism and a position-normalized alignment strategy. The architecture further integrates a semantic reconstruction module, a context-aware semantic fusion framework, and a partial channel attention mechanism. Despite its lightweight design (4.9M parameters, 228 FPS), CMAFNet achieves significant performance gains, attaining an mAP@50 of 32.2% and an APs of 12.5% on the TLRGBD benchmark—outperforming YOLO variants and rivaling Transformer-based approaches.

Technology Category

Application Category

📝 Abstract
Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.
Problem

Research questions and friction points this paper is trying to address.

RGB-D
defect detection
cross-modal alignment
small object detection
UAV inspection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Alignment
Feature Purification
Partial-Channel Attention
RGB-D Fusion
Small Object Detection
🔎 Similar Papers
No similar papers found.
Jiaming Cui
Jiaming Cui
Assistant Professor, Virginia Tech
machine learningai for healthcarescientific modelingpublic health
S
Shuai Zhou
School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin, 150001, China
Wenqiang Li
Wenqiang Li
The Ohio State University
5G securityembedded systemvulnerability discoveryAI
R
Ruifeng Qin
School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin, 150001, China
F
Feng Shen
School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin, 150001, China