Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the lack of reliable uncertainty estimation in existing affordance instance segmentation methods, which limits their reliability in applications such as robotic interaction. The authors propose a Bayesian Vision Transformer–based framework that integrates sampling and ensemble strategies to jointly model pixel-wise epistemic and aleatoric uncertainties at both semantic and spatial levels. A probabilistic mask quality metric is introduced to enhance calibration and interpretability. This approach achieves the first unified modeling of semantic and spatial uncertainties in affordance segmentation, yielding more precise and generalizable masks with well-calibrated output probabilities. On the IIT-Aff dataset, it improves the weighted Fβ score by 7.4 percentage points, and the resulting uncertainty maps exhibit clear semantic interpretability.

📝 Abstract

Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human-robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample-based and ensembles approaches for uncertainty estimation. We extend an attention-based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel-wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability-based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the $F_β^w$ score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.

Problem

Research questions and friction points this paper is trying to address.

Uncertainty Estimation

Instance Segmentation

Affordances

Bayesian Visual Transformers

Visual Affordances

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Visual Transformers

Uncertainty Estimation

Instance Segmentation of Affordances