Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This paper addresses the challenge that non-expert users face in defining and debugging personalized visual sensors for complex scenarios (e.g., “alert when a toddler engages in destructive behavior”) using natural language. To tackle this, we propose Gensors—a novel framework for end-to-end visual sensor specification and refinement. Its core contributions are threefold: (1) a paradigm for sensor construction grounded in decomposable and testable criteria; (2) demand-driven modeling, parallel multi-criterion debugging, image-guided criterion completion, and blind-spot exposure; and (3) tight integration of multimodal large language models (MLLMs) for joint visual perception and reasoning. A user study demonstrates that Gensors significantly improves users’ sense of control (+42%), comprehension (+38%), and expressive efficiency (51% reduction in task time), while effectively uncovering implicit decision criteria and previously unknown failure modes.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs), with their expansive world knowledge and reasoning capabilities, present a unique opportunity for end-users to create personalized AI sensors capable of reasoning about complex situations. A user could describe a desired sensing task in natural language (e.g.,"alert if my toddler is getting into mischief"), with the MLLM analyzing the camera feed and responding within seconds. In a formative study, we found that users saw substantial value in defining their own sensors, yet struggled to articulate their unique personal requirements and debug the sensors through prompting alone. To address these challenges, we developed Gensors, a system that empowers users to define customized sensors supported by the reasoning capabilities of MLLMs. Gensors 1) assists users in eliciting requirements through both automatically-generated and manually created sensor criteria, 2) facilitates debugging by allowing users to isolate and test individual criteria in parallel, 3) suggests additional criteria based on user-provided images, and 4) proposes test cases to help users"stress test"sensors on potentially unforeseen scenarios. In a user study, participants reported significantly greater sense of control, understanding, and ease of communication when defining sensors using Gensors. Beyond addressing model limitations, Gensors supported users in debugging, eliciting requirements, and expressing unique personal requirements to the sensor through criteria-based reasoning; it also helped uncover users'"blind spots"by exposing overlooked criteria and revealing unanticipated failure modes. Finally, we discuss how unique characteristics of MLLMs--such as hallucinations and inconsistent responses--can impact the sensor-creation process. These findings contribute to the design of future intelligent sensing systems that are intuitive and customizable by everyday users.

Problem

Research questions and friction points this paper is trying to address.

Personalized AI Sensors

Complex Situation Reasoning

Multimodal Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Customizable AI Sensors

Enhanced User Control

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker