EgoFormer: Ego-Gesture Classification in Context of Autonomous Driving
Abstract
Decoding the intentions of passengers and other road users remains a critical challenge for autonomous vehicles and intelligent transportation systems. Hand gestures are key in these interactions, offering a direct communication channel. Moreover, egocentric videos mimic a first-person perspective, aligning closely with human visual perception. Yet, the development of deep learning algorithms for detecting egocentric hand gestures in autonomous driving is hindered by the absence of useful datasets. Furthermore, there is a pressing need for gesture recognition methods to evolve from CNN-based architectures to transformer models. To address these challenges, we present EgoDriving, a novel dataset of egocentric hand gestures, curated for driving-related hand gestures. Finally, we introduce EgoFormer, an efficient video transformer for egocentric hand gesture classification that is optimized for edge computing deployments. EgoFormer incorporates a Video Dynamic Position Bias (VDPB) module to enhance long-range positional awareness and leverage absolute positions from convolutional sub-layers within its transformer blocks. Designed for low-resource settings, EgoFormer offers substantial reductions in inference latency and GPU utilization while maintaining competitive accuracy against state-of-the-art hand gesture recognition frameworks.
EgoDriving Dataset
Overall pipeline
VDPB
Performance comparison on EgoDriving dataset
| Method | Top-1 | F1 Score | Precision | Recall |
| Video Transformer Models | ||||
| TimeSformer [20] | 57.81 | - | - | - |
| MotionFormer [19] | - | - | - | - |
| VideoSwin [36] | - | - | - | - |
| Hand Gesture Classification Models | ||||
| RGDC [30] | 10.00 | 9.195 | 10.057 | 9.195 |
| TBNDR [23] | 36.67 | 34.02 | 36.72 | 36.67 |
| EgoFormer(Ours) |
|
|
|
|