🤖 AI Summary
This study addresses video-based empathy recognition in privacy-sensitive settings, where only pre-extracted visual features (in tabular format) are available and raw videos are inaccessible. We pioneer the application of tabular foundation models—TabPFN v2 and TabICL—to cross-subject empathy detection under such constraints. To ensure ecological validity, we propose an individual-generalization evaluation framework grounded in strict cross-subject validation protocols, leveraging both in-context learning and fine-tuning strategies. On a human–computer interaction benchmark, our approach achieves a cross-subject accuracy of 0.730 (+14.0 percentage points) and AUC of 0.669 (+10.5 percentage points), significantly outperforming strong baselines including conventional tree-based models. Our key contributions are threefold: (1) the first adaptation of tabular foundation models to computational empathy; (2) the design of a subject-generalizable evaluation paradigm aligned with real-world deployment requirements; and (3) a practical, privacy-compliant pathway for multimodal affective understanding under data-restriction constraints.
📝 Abstract
Detecting empathy from video interactions is an emerging area of research. Video datasets, however, are often released as extracted features (i.e., tabular data) rather than raw footage due to privacy and ethical concerns. Prior research on such tabular datasets established tree-based classical machine learning approaches as the best-performing models. Motivated by the recent success of textual foundation models (i.e., large language models), we explore the use of tabular foundation models in empathy detection from tabular visual features. We experiment with two recent tabular foundation models $-$ TabPFN v2 and TabICL $-$ through in-context learning and fine-tuning setups. Our experiments on a public human-robot interaction benchmark demonstrate a significant boost in cross-subject empathy detection accuracy over several strong baselines (accuracy: $0.590
ightarrow 0.730$; AUC: $0.564
ightarrow 0.669$). In addition to performance improvement, we contribute novel insights and an evaluation setup to ensure generalisation on unseen subjects in this public benchmark. As the practice of releasing video features as tabular datasets is likely to persist due to privacy constraints, our findings will be widely applicable to future empathy detection video datasets as well.