<strong>Paper Title</strong><br>

MULTIMODAL DEEPFAKE DETECTION USING EFFICIENTNET-B2 AND RAWNET2 WITH ATTENTION-BASED FUSION<br>

<br>


<strong>Abstract</strong><br>

Deepfake technology severely undermines the integrity of digital media by the creation of highly convincing fake audio-visual content. This article presents a robust multimodal deepfake identification system that takes advantage of complementary audio and visual cues for higher detection accuracy. To capture spatial facial artifacts in the video frames, our method uses EfficientNet-B2, while RawNet2 is used to locate acoustic inconsistencies in the raw audio waveform. A novel attentionbased fusion method is used to combine these modalities in an optimal way, thus allowing the system to find very faint traces of manipulation that mono-modal methods are incapable of detecting. The results of extensive experiments on the benchmark datasets such as DFDC and FaceForensics++ demonstrate that our multimodal method achieves an accuracy of 96.3% and an AUC-ROC of 0.982, thus improving the performance of state-of-the-art unimodal methods by 7.2-12.4%. The system is particularly powerful in identifying high-quality manipulations where neither visual nor audio artifacts alone can provide reliable detection. Our research confirms that multimodal integration is indispensably required for future deepfake detection systems.

Keywords - Deepfake Detection, Multimodal Learning, Audio-Visual Analysis, EfficientNet, RawNet, Digital Media Forensics