An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

📅 2025-05-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Human behavior understanding (HBU) in low-resolution visual modalities—such as depth, thermal, and infrared imaging—lacks effective large vision-language model (LVLM) adaptation methods and heavily relies on costly manual annotations. Method: This paper proposes an edge-deployable lightweight HBU system that jointly leverages contrastive-driven pseudo-label generation and physics-informed temporal consistency verification to enable high-quality video captioning under few-shot supervision. Technically, it integrates contrastive learning, physics-constrained modeling, LLM prompt engineering, and LoRA-based efficient fine-tuning. Contribution/Results: Evaluated on a real-world regional platform and three low-resolution benchmark datasets, the system achieves an average BERT-Score gain of 40.03% over state-of-the-art LVLMs, significantly alleviating the performance degradation of LVLMs on degraded visual inputs and advancing practical edge-aware HBU.

Technology Category

Application Category

📝 Abstract
The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with raw data to effectively fine-tune LVLM models for understanding low-resolution videos in HBU. First, we propose a Contrastive-Oriented Data Labeler, which can capture behavior-relevant information from long, low-resolution videos and generate high-quality pseudo labels for unlabeled data via contrastive learning. Second, we propose a Physical-Knowledge Guided Captioner, which utilizes spatial and temporal consistency checks to mitigate errors in pseudo labels. Therefore, it can improve LLMs' understanding of sequential data and then generate high-quality video captions. Finally, to ensure on-device deployability, we employ LoRA-based efficient fine-tuning to adapt LVLMs for low-resolution data. We evaluate Llambda using a region-scale real-world testbed and three distinct low-resolution datasets, and the experiments show that Llambda outperforms several state-of-the-art LVLM systems up to $40.03%$ on average Bert-Score.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLM understanding of low-resolution human behavior data
Reducing labor-intensive annotation for low-resolution video captioning
Optimizing LVLM deployability on devices for HBU tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive-Oriented Data Labeler for pseudo labels
Physical-Knowledge Guided Captioner for error mitigation
LoRA-based fine-tuning for on-device deployability
🔎 Similar Papers
No similar papers found.
Siyang Jiang
Siyang Jiang
The Chinese University of Hong Kong
Foundation ModelsFederated LearningFew-Shot LearningAIoT
B
Bufang Yang
The Chinese University of Hong Kong, Hong Kong SAR
L
Lilin Xu
Columbia University, United States
M
Mu Yuan
The Chinese University of Hong Kong, Hong Kong SAR
Y
Yeerzhati Abudunuer
The Chinese University of Hong Kong, Hong Kong SAR
K
Kaiwei Liu
The Chinese University of Hong Kong, Hong Kong SAR
L
Liekang Zeng
The Chinese University of Hong Kong, Hong Kong SAR
H
Hongkai Chen
The Chinese University of Hong Kong, Hong Kong SAR
Z
Zhenyu Yan
The Chinese University of Hong Kong, Hong Kong SAR
Xiaofan Jiang
Xiaofan Jiang
Associate Professor of Electrical Engineering, Columbia University
Mobile and Embedded SystemsArtificial Intelligence of ThingsSmart Health and FitnessCPHS
Guoliang Xing
Guoliang Xing
The Chinese University of Hong Kong
Embedded AIAI for HealthAutonomous DrivingCyber-Physical SystemsWireless Networks