DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong^1,2, Peng Zhang^1,2, Tao You¹, Chuanyue Li¹, Wei Huang³, Yufei Zha^1,2,

Northwestern Polytechnical University¹

Ningbo Institute of Northwestern Polytechnical University ²

Nanchang University³

CVPR, 2024

Overview

Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.

Method

Both the localization-based and 3D convolution-based methods use tailored network structures and sophisticated loss functions to predict saliency areas. Differently, our diffusion-based approach is a generalized audio-visual saliency prediction framework using simple MSE objective function.

The proposed DiffSal contains Video and Audio Encoders as well as Saliency-UNet. The former is used to extract multi-scale spatio-temporal video features and audio features from image sequences and corresponding audio signals. By conditioning on these semantic video and audio features, the latter performs multi-modal attention modulation to progressively refine the ground-truth saliency map from the noisy map.

Results

Qualitative results of audio-visual saliency maps.

Demo video.

BibTeX

  @inproceedings{xiong2024diffsal,
    title={DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction},
    author={Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang and Yufei Zha},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2024}
  }