An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

📅 2025-05-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Human behavior understanding (HBU) in low-resolution visual modalities—such as depth, thermal, and infrared imaging—lacks effective large vision-language model (LVLM) adaptation methods and heavily relies on costly manual annotations. Method: This paper proposes an edge-deployable lightweight HBU system that jointly leverages contrastive-driven pseudo-label generation and physics-informed temporal consistency verification to enable high-quality video captioning under few-shot supervision. Technically, it integrates contrastive learning, physics-constrained modeling, LLM prompt engineering, and LoRA-based efficient fine-tuning. Contribution/Results: Evaluated on a real-world regional platform and three low-resolution benchmark datasets, the system achieves an average BERT-Score gain of 40.03% over state-of-the-art LVLMs, significantly alleviating the performance degradation of LVLMs on degraded visual inputs and advancing practical edge-aware HBU.

Technology Category

Application Category

📝 Abstract

The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with raw data to effectively fine-tune LVLM models for understanding low-resolution videos in HBU. First, we propose a Contrastive-Oriented Data Labeler, which can capture behavior-relevant information from long, low-resolution videos and generate high-quality pseudo labels for unlabeled data via contrastive learning. Second, we propose a Physical-Knowledge Guided Captioner, which utilizes spatial and temporal consistency checks to mitigate errors in pseudo labels. Therefore, it can improve LLMs' understanding of sequential data and then generate high-quality video captions. Finally, to ensure on-device deployability, we employ LoRA-based efficient fine-tuning to adapt LVLMs for low-resolution data. We evaluate Llambda using a region-scale real-world testbed and three distinct low-resolution datasets, and the experiments show that Llambda outperforms several state-of-the-art LVLM systems up to $40.03%$ on average Bert-Score.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLM understanding of low-resolution human behavior data

Reducing labor-intensive annotation for low-resolution video captioning

Optimizing LVLM deployability on devices for HBU tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive-Oriented Data Labeler for pseudo labels

Physical-Knowledge Guided Captioner for error mitigation

LoRA-based fine-tuning for on-device deployability

🔎 Similar Papers

No similar papers found.