🤖 AI Summary
Existing methods for recognizing public transit payment behaviors suffer from insufficient robustness in in-vehicle noisy environments and rely on handcrafted features with limited generalization capability. To address these limitations, this work proposes iPay, a novel framework that fuses RGB and skeleton modalities through a multimodal mixture-of-experts network. The architecture integrates graph convolutional networks to model spatiotemporal skeletal dynamics and region-focused RGB features, enhanced by a dual-attention fusion mechanism to enable complementary cross-modal interactions. A key innovation is the introduction of a spatial discrepancy discriminator that explicitly models the relative motion between hand regions and payment anchor points to strengthen task-specific discriminability. Evaluated on a newly collected real-world in-vehicle dataset comprising 55 hours of footage, iPay achieves an accuracy of 83.45%, significantly improving recognition robustness while maintaining efficiency for edge deployment.
📝 Abstract
Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45\% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.