An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of real-time, long-horizon first-person vision-language understanding on resource-constrained portable devices, this paper introduces Vinci—a holistic system for egocentric video-language reasoning. Methodologically, Vinci proposes a lightweight EgoVideo-VL model, the first end-to-end architecture unifying first-person visual foundation models with large language models. It features a hardware-agnostic deployment framework, a streaming long-video memory module enabling persistent contextual modeling, a cross-perspective (egocentric ↔ third-person) semantic retrieval mechanism, and a visualization-aware action generation module. Empirically, Vinci achieves state-of-the-art performance across multiple public benchmarks on scene understanding, temporal grounding, video summarization, and future planning tasks. A user study validates its practical utility in real-world scenarios. The entire stack—including models, frameworks, and tools—is open-sourced.

Technology Category

Application Category

📝 Abstract
We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.
Problem

Research questions and friction points this paper is trying to address.

Real-time AI assistance on portable devices
Integration of egocentric vision and language models
Hardware-agnostic deployment for diverse scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates egocentric vision with large language model
Real-time memory module for contextual video processing
Hardware-agnostic deployment on various portable devices
🔎 Similar Papers
No similar papers found.
Y
Yifei Huang
The University of Tokyo, Japan
Jilan Xu
Jilan Xu
Fudan University
Computer VisionMultimodalMedical Image Analysis
Baoqi Pei
Baoqi Pei
Zhejiang University
Computer VisionMultimodal Learning
Y
Yuping He
Nanjing University, China
G
Guo Chen
Nanjing University, China
Mingfang Zhang
Mingfang Zhang
The University of Tokyo
Computer Vision
L
Lijin Yang
The University of Tokyo, Japan
Z
Zheng Nie
Shanghai AI Laboratory, China
J
Jinyao Liu
Shanghai AI Laboratory, China
G
Guoshun Fan
Shanghai AI Laboratory, China
D
Dechen Lin
Shanghai AI Laboratory, China
F
Fang Fang
Shanghai AI Laboratory, China
Kunpeng Li
Kunpeng Li
Research Scientist, Meta Superintelligence Labs
Computer VisionDeep Learning
Chang Yuan
Chang Yuan
Shanghai AI Laboratory, China
Yaohui Wang
Yaohui Wang
Research Scientist, Shanghai AI Laboratory | Inria
Machine LearningDeep Generative ModelsVideo Generation
X
Xinyuan Chen
Shanghai AI Laboratory, China
Y
Yali Wang
Shanghai AI Laboratory, China
Y
Yu Qiao
Shanghai AI Laboratory, China
L
Limin Wang
Shanghai AI Laboratory, China