Junwen Xiong

Hi, I am Junwen Xiong (熊俊文), a second-year Ph.D. student in the Department of Computer Science at Northwestern Polytechnical University, advised by Prof. Peng Zhang.

I'm broadly interested in multimodal learning (images, audio, video, etc.). My recent research lies in audio-visual speech separation, sound source localization.

Email | Google Scholar | Github

News

[Fer. 2025] One paper UniST is accepted by TCSVT'25.

[Mar. 2024] One paper DiffSal is accepted to CVPR'24.

[Feb. 2023] One paper about audio-visual saliency prediction is accepted to CVPR'23.

[Aug. 2022] One paper about multi-modal correlation learning is accepted by TMM'22.

Publication

Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

Junwen Xiong, Chuanyue Li; Tianyu Liu; Peng Zhang; Yue Huo; Wei Huang; Yufei Zha
TCSVT, 2025 [paper] [webpage]
Is it possible to build a unified saliency model generalized to video saliency prediction and video salient object detection tasks? Sure!

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha
CVPR, 2024 [paper] [webpage]
Generalized audio-visual saliency prediction framework

CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective

Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai
CVPR, 2023 [paper] [webpage]
Audio-visual consistency perception matters

Look&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Junwen Xiong, Yu Zhou, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha
TMM, 2022 [paper] [webpage]
Unified correlation learning framework to solve two audio-visual tasks

Preprint

FTFDNet: Learning to Detect Talking Face Video Manipulation with Tri-Modality Interaction

Ganglai Wang, Peng Zhang, Junwen Xiong, Feihan Yang, Wei Huang, Yufei Zha
[paper]
Incorporating three modalities to detect talking face video manipulation

Audio-visual speech separation based on joint feature representation with cross-modal attention

Junwen Xiong, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha, Yanning Zhang
arXiv preprint, 2022 [paper]
Novel fusion methods for audio, video and optical flow modalities

Service

Journal Reviewing: Image and Vision Computing, TCSVT.
Conference Reviewing: ECCV 2024, ICLR 2025, CVPR 2025, ICCV 2025, ICML 2025, NeurlPS 2025.