The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This paper introduces and formalizes the “N-body problem”: given a first-person single-user video, generate physically feasible, temporally causal, and coordinated plans for N virtual agents executing the same task in parallel to maximize speedup. To address real-world constraint violations—such as spatial overlap, object contention, and causal inconsistencies—arising from naive temporal segmentation, we propose a multi-dimensional feasibility assessment framework integrating spatial, object-level, and causal constraints. We further design a structured vision-language prompting strategy to guide Gemini 2.5 Pro in 3D environment modeling, object interaction reasoning, and temporal dependency analysis. Evaluated on 100 videos from EPIC-Kitchens and HD-EPIC (N=2), our method achieves a 45% improvement in action coverage and reduces spatial collisions, object conflicts, and causal inconsistencies by 55%, 45%, and 55%, respectively. This work presents the first verifiable and interpretable multi-agent first-person parallel task planning framework.

Technology Category

Application Category

📝 Abstract

Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.

Problem

Research questions and friction points this paper is trying to address.

Parallelizing tasks from single-person video

Ensuring physical feasibility in multi-person execution

Evaluating performance and constraint violations in parallelization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured prompting guides VLM reasoning

Metrics evaluate performance and feasibility constraints

Method boosts coverage and reduces conflicts significantly

🔎 Similar Papers

No similar papers found.