Salta al contenuto principale
Passa alla visualizzazione normale.

SOULAYMA GAZZEH

Context-ped: multi-modal context fusion for pedestrian crossing intention prediction

Abstract

Predicting pedestrian crossing intentions is crucial for enhancing autonomous vehicles’ safety and decision-making capabilities. Timely and accurate predictions made well before potential crossing events are essential for enabling vehicles to respond appropriately and prevent accidents. Unlike state-of-the-art models that rely on complex architectures and multiple input modalities, this article proposes Context-Ped, a minimalist effective model for pedestrian intention prediction. To capture critical contextual information, Context-Ped processes two key visual inputs: cropped pedestrian images and environmental road structure. By integrating residual networks with recurrent architectures, the model extracts robust spatiotemporal features through the aggregation of hidden states from ConvLSTM layers. This operation serves as a temporal integration strategy that accumulates the spatiotemporal patterns over the sequence, capturing both dynamic motion cues and static contextual information to encode both pedestrian behavior and environmental context. As a result, Context-Ped achieves accurate intention prediction while maintaining computational efficiency. Additionally, the proposed approach addresses the often-overlooked issue of class imbalance by employing the focal loss function, which significantly enhances performance for the minority class. For evaluation, AUC-ROC and recall metrics are prioritized over traditional accuracy, providing clearer insight into the model’s performance. Experimental results on the JAAD and PIE datasets demonstrate competitive performance, achieving an AUC of 79% and a crossing recall of 82% on the JAAD dataset, outperforming more complex state-of-the-art models and by only using visual input data.