Visually Indicated Sound Generation by Perceptually Optimized Classification

Proc. of the 1st Multimodal Learning and Applications Workshop (MULA 2018)

Publication date: September 9, 2018

Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, Ram Nevatia

Best Paper Award

Visually indicated sound generation aims to predict visually consistent sound from the video content. Previous methods addressed this problem by creating a single generative model that ignores the distinctive characteristics of various sound categories. Nowadays, state of the art sound classification networks are available to capture semantic level information in audio modality, which can also serve for the purpose of visually indicated sound generation. In this paper, we explore generating fine-grained sound from a variety of sound classes, and leverage pre-trained sound classification networks to improve the audio generation quality. We propose a novel Perceptually Optimized Classification based Audio generation Network (POCAN), which generates sound conditioned on the sound class predicted from visual information. Additionally, a perceptual loss is calculated via a pre-trained sound classification network to align the semantic information between the generated sound and its ground truth during training. Experiments show that POCAN achieves significantly better results in visually indicated sound generation task on two datasets.

Learn More