YOLOA: Real-Time Affordance Detection via LLM Adapter

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing affordance detection methods typically model “how to use” (how) in isolation, neglecting joint reasoning about “what the object is” (what) and “where it is located” (where), while also lacking synergy between detection and affordance learning. To address this, we propose YOLOA—a novel real-time perception framework that integrates a large language model (LLM) as a lightweight semantic adapter for the first time. Its dual-branch architecture enables end-to-end joint optimization of object detection and affordance prediction: the LLM dynamically generates class priors, bounding-box offsets, and affordance gating signals to facilitate cross-task semantic guidance and iterative refinement. Evaluated on ADG-Det and IIT-Heat, YOLOA achieves 52.8/73.1 mAP, with peak inference speeds of 89.77 FPS (lightweight variant: 846.24 FPS), substantially outperforming state-of-the-art methods in both accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Affordance detection aims to jointly address the fundamental"what-where-how"challenge in embodied AI by understanding"what"an object is,"where"the object is located, and"how"it can be used. However, most affordance learning methods focus solely on"how"objects can be used while neglecting the"what"and"where"aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance preliminary predictions to refine both branches by generating more accurate class priors, box offsets, and affordance gates. Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy (52.8 / 73.1 mAP on ADG-Det / IIT-Heat) while maintaining real-time performance (up to 89.77 FPS, and up to 846.24 FPS for the lightweight variant). This indicates that YOLOA achieves an excellent trade-off between accuracy and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Jointly addressing object detection and affordance learning in real-time
Overcoming the neglect of 'what' and 'where' aspects in affordance detection
Integrating LLM adapter to refine predictions for accuracy and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

YOLOA integrates object detection and affordance learning via LLM adapter
LLM adapter refines predictions by generating class priors and affordance gates
Model achieves real-time performance with high accuracy on benchmarks
🔎 Similar Papers
Y
Yuqi Ji
School of Electronic Engineering, Xidian University
J
Junjie Ke
School of Software, Tsinghua University
Lihuo He
Lihuo He
Professor, Xidian University
Image/Video Quality AssessmentVisual Perception
J
Jun Liu
School of Electronic Engineering, Xidian University
K
Kaifan Zhang
School of Electronic Engineering, Xidian University
Yu-Kun Lai
Yu-Kun Lai
Professor, Cardiff University
Geometric ModelingGeometry ProcessingComputer GraphicsImage ProcessingComputer Vision
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval
X
Xinbo Gao
School of Electronic Engineering, Xidian University