Object segmentation in the wild with foundation models: application to vision assisted neuro-prostheses for upper limbs

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work addresses the challenge of task-agnostic semantic segmentation of daily objects in highly cluttered first-person (egocentric) visual scenes for vision-guided upper-limb neuroprostheses. We propose a novel prompting paradigm that automatically generates prompts from gaze fixations—integrating human visual attention mechanisms with the Segment Anything Model (SAM) without requiring image-specific fine-tuning. To enhance real-world applicability, we further perform lightweight adaptation of SAM on egocentric data. Evaluated on the Grasping-in-the-Wild dataset, our method achieves a +0.51-point improvement in mean IoU, significantly boosting segmentation accuracy, robustness, and clinical utility in complex environments. The core contribution is the first use of physiological gaze signals as a universal, biologically grounded prompt source—bridging human attention modeling and foundation model prompting for egocentric semantic segmentation.

Technology Category

Application Category

📝 Abstract

In this work, we address the problem of semantic object segmentation using foundation models. We investigate whether foundation models, trained on a large number and variety of objects, can perform object segmentation without fine-tuning on specific images containing everyday objects, but in highly cluttered visual scenes. The ''in the wild'' context is driven by the target application of vision guided upper limb neuroprostheses. We propose a method for generating prompts based on gaze fixations to guide the Segment Anything Model (SAM) in our segmentation scenario, and fine-tune it on egocentric visual data. Evaluation results of our approach show an improvement of the IoU segmentation quality metric by up to 0.51 points on real-world challenging data of Grasping-in-the-Wild corpus which is made available on the RoboFlow Platform (https://universe.roboflow.com/iwrist/grasping-in-the-wild)

Problem

Research questions and friction points this paper is trying to address.

Semantic object segmentation in cluttered scenes using foundation models

Vision-guided neuroprostheses for upper limbs application

Improving IoU metric with gaze-based SAM fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses foundation models for semantic segmentation

Generates prompts based on gaze fixations

Fine-tunes SAM on egocentric visual data

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Vision Foundation Model Research Intern

Intrinsic

Salary Range$57.69—$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.

Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)