The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI assistants rely on explicit user queries and thus fail to proactively detect and rectify inefficient operations—such as redundant edits—in spreadsheet applications like Excel, resulting in delayed, generic, and low-actionability guidance. Method: We propose the first end-to-end vision-grounded reflection system based on screen recordings, requiring no APIs, instrumentation logs, or explicit input. It leverages a vision-language model to parse interface states and reconstruct fine-grained user action sequences, followed by a large language model that generates structured, context-aware optimization suggestions. Contribution: We introduce a novel two-stage behavioral inference pipeline enabling high-fidelity action reconstruction and context-sensitive recommendation generation. Empirical evaluation demonstrates that our system accurately identifies inefficiency patterns, yielding more personalized and executable workflow suggestions. It significantly outperforms conventional prompt-based assistants in both user learning efficiency and task completion quality.

Technology Category

Application Category

📝 Abstract
Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.
Problem

Research questions and friction points this paper is trying to address.

Detects inefficient workflows from screen recordings
Recommends better alternatives without user input
Uses vision-language models to reconstruct user actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses screen recordings to infer user actions
Applies vision-language model for action reconstruction
Generates structured suggestions via language model
🔎 Similar Papers
No similar papers found.