🤖 AI Summary
To address the challenges of real-time, long-horizon first-person vision-language understanding on resource-constrained portable devices, this paper introduces Vinci—a holistic system for egocentric video-language reasoning. Methodologically, Vinci proposes a lightweight EgoVideo-VL model, the first end-to-end architecture unifying first-person visual foundation models with large language models. It features a hardware-agnostic deployment framework, a streaming long-video memory module enabling persistent contextual modeling, a cross-perspective (egocentric ↔ third-person) semantic retrieval mechanism, and a visualization-aware action generation module. Empirically, Vinci achieves state-of-the-art performance across multiple public benchmarks on scene understanding, temporal grounding, video summarization, and future planning tasks. A user study validates its practical utility in real-world scenarios. The entire stack—including models, frameworks, and tools—is open-sourced.
📝 Abstract
We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.