CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective
Junwen Xiong1 Ganglai Wang1 Peng Zhang1,2, Wei Huang3, Yufei Zha1,2, Guangtao Zhai4,
Northwestern Polytechnical University1
Ningbo Institute of Northwestern Polytechnical University 2
Nanchang University3
Shanghai Jiao tong University4
CVPR, 2023
Overview
In this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets.
Method
The example figure shows the saliency results of our model compared to STAViS in audio and video temporal sequences. In the last time segment, the audio information that occurs in the event is inconsistent with the visual information. Our method can cope with such challenge by automatically learning to align the audio-visual features. The results of STAViS, however, show that it is incapable to address the problem of audio-visual inconsistency. GT denotes ground truth.
The proposed CASP-Net is composed of: a two-stream network to obtain visual saliency and auditory saliency feature, an audio-visual interaction module to integrate the visual and auditory conspicuity maps, a consistency-aware predictive coding module to reason the coherent spatio-temporal visual feature with audio feature, and a saliency decoder to estimate saliency map with multi-scale audio-visual features.
Results
Qualitative results of audio-visual saliency maps.
Demo video.

BibTeX
  @inproceedings{xiong2023casp,
    title={CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective},
    author={Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha and Guangtao Zhai},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2023}
  }