Learning to Answer Questions in Dynamic Audio-Visual Scenarios (CVPR2022, Oral)

503
0
2022-03-08 23:10:17
正在缓冲...
4
6
7
7
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Project Homepage: https://gewu-lab.github.io/MUSIC-AVQA/
青春正好,大步前进!只争朝夕,不负韶华!
接下来播放
自动连播
客服
顶部
赛事库 课堂 2021拜年纪