Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

πŸ“… 2025-09-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Pixel-level annotation for in-hand object segmentation is prohibitively expensive and data-scarce. Method: We introduce NS-iHOS, a novel task that leverages human instructional narratives as weak supervision to implicitly model hand–object pixel-level correspondences, enabling end-to-end segmentation without dense annotations. Our approach integrates vision-language model distillation, open-vocabulary object detection, and weakly supervised learning to extract semantic cues from operational language and align them with visual features; at inference, it requires only an input image and performs purely visual reasoning. Contribution/Results: Evaluated on EPIC-Kitchens and Ego4D, our method achieves over 50% of fully supervised performance while drastically reducing reliance on manual annotations. This demonstrates the effectiveness and scalability of narrative-driven weak supervision for in-hand object segmentation.

Technology Category

Application Category

πŸ“ Abstract
Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations -- natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., "I am pouring vegetables from the chopping board to the pan"). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.
Problem

Research questions and friction points this paper is trying to address.

Learning in-hand object segmentation using weak supervision from narrations
Reducing reliance on costly manual pixel-level annotations
Enabling egocentric vision applications without narration at inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using human narrations as weak supervision
Learning hand-object associations without pixel labels
Distilling narration knowledge for inference without narrations
πŸ”Ž Similar Papers
No similar papers found.