REZA SHAHBAZIAN

ResViT: A Hybrid Model for Robust Deepfake Video Detection

Autori: Aria, A.; Mirtaheri, S.L.; Asghari, S.A.; Shahbazian, R.; Pugliese, A.
Anno di pubblicazione: 2025
Tipologia: Contributo in atti di convegno pubblicato in volume
OA Link: http://hdl.handle.net/10447/696385

Abstract

This paper presents an innovative method to detect Deepfake videos. The proposed model, ResNet Vision Transformer (ResViT), incorporates two complementary components: a Convolutional Neural Network (CNN) founded on the ResNet50 architecture for effective feature extraction and a Vision Transformer (ViT) for categorization. The CNN captures spatial characteristics from video frames, which are then analyzed by the ViT employing attention mechanisms to differentiate between authentic and altered videos. We assessed ResViT using two benchmark datasets, the Deepfake Detection Challenge (DFDC) dataset and the FaceForensics++ dataset, attaining outstanding results. Our model attained an accuracy of 97.1% on the DFDC dataset, illustrating its efficacy in Deepfake detection. Furthermore, ResViT attained accuracies of 86.8%, 75.1%, 75.5%, and 94.9% on the FaceForensics++ subsets (i.e., Face2Face, FaceSwap, NeuralTextures, and DeepFakes), underscoring its robustness and adaptability across various manipulation methods. These findings emphasize the promise of ResViT as a reliable method for Deepfake video detection.