Cross-Modal Alignment and Fusion for RGB-D Transmission-Line Defect Detection

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of detecting small-scale defects in power transmission lines, where complex backgrounds and illumination variations hinder existing RGB-based methods from distinguishing subtle geometric structures. To overcome this limitation, the authors propose CMAFNet, a novel framework following a “purify-then-fuse” paradigm that explicitly aligns RGB and depth modalities while suppressing noise through a codebook-driven feature purification mechanism and a position-normalized alignment strategy. The architecture further integrates a semantic reconstruction module, a context-aware semantic fusion framework, and a partial channel attention mechanism. Despite its lightweight design (4.9M parameters, 228 FPS), CMAFNet achieves significant performance gains, attaining an mAP@50 of 32.2% and an APs of 12.5% on the TLRGBD benchmark—outperforming YOLO variants and rivaling Transformer-based approaches.

Technology Category

Application Category

📝 Abstract

Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.

Problem

Research questions and friction points this paper is trying to address.

RGB-D

defect detection

cross-modal alignment

small object detection

UAV inspection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Alignment

Feature Purification

Partial-Channel Attention