Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of cross-modal semantic misalignment and high computational overhead in small-object detection, this paper proposes a semantic-guided lightweight vision-language fusion framework. Methodologically, we design a novel semantic alignment mechanism that tightly couples a BERT-based text encoder with a PRB-FPN-Net visual backbone, integrating ELAN, MSP, and CSP architectural components; fine-grained cross-modal feature alignment is achieved via lemmatization and end-to-end fine-tuning. Our contribution lies in achieving both high accuracy and parameter efficiency: the framework attains 52.6% AP on COCO2017—substantially outperforming YOLO-World and GLIP—while requiring only half the parameters of typical Transformer-based models. Extensive experiments across multiple backbones demonstrate its efficiency, robustness, and scalability, establishing a new paradigm for open-vocabulary small-object detection under resource-constrained settings.

Technology Category

Application Category

📝 Abstract
This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones. The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), incorporating innovative backbone architectures such as ELAN, MSP, and CSP to optimize feature extraction and fusion. By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects. Experimental validation using the COCO and Objects365 datasets demonstrates that the model achieves superior performance. On the COCO2017 validation set, it attains a 52.6% average precision (AP), outperforming YOLO-World significantly while maintaining half the parameter consumption of Transformer-based models like GLIP. Several test on different of backbones such ELAN, MSP, and CSP further enable efficient handling of multi-scale objects, ensuring scalability and robustness in resource-constrained environments. This study underscores the potential of integrating natural language understanding with advanced backbone architectures, setting new benchmarks in object detection accuracy, efficiency, and adaptability to real-world challenges.
Problem

Research questions and friction points this paper is trying to address.

Improving tiny object detection through cross-modal fusion
Aligning semantic language cues with visual features
Optimizing detection for small objects in resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates BERT with CNN-based PRB-FPN-Net architecture
Aligns semantic cues from text with visual features
Uses ELAN, MSP, CSP backbones for multi-scale efficiency
🔎 Similar Papers
No similar papers found.
X
Xian-Hong Huang
Department of Electrical Engineering, National Formosa University, Taiwan
H
Hui-Kai Su
Department of Electrical Engineering, National Formosa University, Taiwan
Chi-Chia Sun
Chi-Chia Sun
Professor, National Taipei University
FPGAVLSIImage ProcessingMachine Learning
Jun-Wei Hsieh
Jun-Wei Hsieh
National Yang Ming Chiao Tung University
computer visionAIimage processing