Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current screen exposure monitoring for young children relies on subjective self-reports or bulky wearable devices, hindering accurate, unobtrusive, long-term quantification. This paper introduces the first lightweight wearable system for child-centric screen time tracking, integrating egocentric video with a multi-view vision-language model (VLM). Our approach employs temporal egocentric modeling, multimodal sensor fusion, and self-supervised temporal representation learning to dynamically detect real-world screen use behaviors. Key innovations include: (i) the first multi-view VLM architecture explicitly modeling contextual cues of screen interaction; and (ii) a hardware-algorithm co-designed Screen Time Tracker framework. Evaluated on a naturalistic child free-play dataset, our method achieves 23.6% higher screen time detection accuracy than single-view VLM and object-detection baselines. It enables fine-grained, long-term, passive monitoring in ecologically valid settings.

Technology Category

Application Category

📝 Abstract
Being able to accurately monitor the screen exposure of young children is important for research on phenomena linked to screen use such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel sensor informatics framework that utilizes egocentric images from a wearable sensor, termed the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image sequences and interprets screen exposure dynamically. We validated our approach by using a dataset of children's free-living activities, demonstrating significant improvement over existing methods in plain vision language models and object detection models. Results supported the promise of this monitoring approach, which could optimize behavioral research on screen exposure in children's naturalistic settings.
Problem

Research questions and friction points this paper is trying to address.

Accurately monitor children's screen exposure for health research
Overcome inefficiency of self-report and bulky wearable sensors
Improve dynamic screen exposure detection using multi-view VLM
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes egocentric images from wearable screen time tracker
Employs multi-view vision language model for dynamic interpretation
Validated with children's free-living activities dataset
🔎 Similar Papers
No similar papers found.
X
Xinlong Hou
Department of Biomedical Engineering, Stevens Institute of Technology
S
Sen Shen
Department of Computer Science, Iowa State University
X
Xueshen Li
Department of Biomedical Engineering, Stevens Institute of Technology
X
Xinran Gao
Department of Electrical Engineering, Columbia University in the City of New York
Ziyi Huang
Ziyi Huang
Assistant Professor @ Arizona State University
Trustworthy AI for Health
S
Steven J. Holiday
Department of Communication & Information Science, University of Alabama
Matthew R. Cribbet
Matthew R. Cribbet
Department of Psychology, University of Alabama
S
Susan W. White
Department of Psychology, University of Alabama
Edward Sazonov
Edward Sazonov
Professor of Electrical and Computer Engineering, The University of Alabama
Wearable sensorsbiomedical signal processing and pattern recognition
Y
Yu Gan
Department of Biomedical Engineering, Stevens Institute of Technology