Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenges of inaccurate hand–object interaction parsing in egocentric videos, particularly issues such as rigid query initialization, interaction-irrelevant distractions, and physically implausible “interaction hallucinations.” To this end, the authors propose InterFormer, an end-to-end interaction-aware Transformer. InterFormer incorporates spatial dynamics of hand–object contact into query initialization via a dynamic query generator, employs a dual-context feature selector to fuse interaction cues with semantic features for noise suppression, and introduces a conditional co-occurrence loss to enforce physical plausibility by constraining hand–object co-occurrence consistency. Evaluated on EgoHOS and the out-of-distribution mini-HOI4D dataset, InterFormer achieves state-of-the-art performance, significantly improving both segmentation accuracy and generalization capability.

Technology Category

Application Category

📝 Abstract

A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to"interaction illusion", producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at https://github.com/yuggiehk/InterFormer.

Problem

Research questions and friction points this paper is trying to address.

egocentric hand-object parsing

interaction-aware representation

transformer-based modeling

physical consistency

query initialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interaction-aware Transformer

Dynamic Query Generator

Dual-context Feature Selector

Conditional Co-occurrence Loss

Egocentric Hand-Object Parsing

🔎 Similar Papers

CaRe-Ego: Contact-aware Relationship Modeling for Egocentric Interactive Hand-object Segmentation

2024-07-08Citations: 1