Text-Driven Foley Sound Generation With Latent Diffusion Model. (arXiv:2306.10359v2 [cs.SD] UPDATED)

Text-Driven Foley Sound Generation With Latent Diffusion Model. (arXiv:2306.10359v2 [cs.SD] UPDATED)
By: <a href="http://arxiv.org/find/cs/1/au:+Yuan_Y/0/1/0/all/0/1">Yi Yuan</a>, <a href="http://arxiv.org/find/cs/1/au:+Liu_H/0/1/0/all/0/1">Haohe Liu</a>, <a href="http://arxiv.org/find/cs/1/au:+Liu_X/0/1/0/all/0/1">Xubo Liu</a>, <a href="http://arxiv.org/find/cs/1/au:+Kang_X/0/1/0/all/0/1">Xiyuan Kang</a>, <a href="http://arxiv.org/find/cs/1/au:+Wu_P/0/1/0/all/0/1">Peipei Wu</a>, <a href="http://arxiv.org/find/cs/1/au:+Plumbley_M/0/1/0/all/0/1">Mark D.Plumbley</a>, <a href="http://arxiv.org/find/cs/1/au:+Wang_W/0/1/0/all/0/1">Wenwu Wang</a> Posted: June 23, 2023

Foley sound generation aims to synthesise the background sound for multimedia
content. Previous models usually employ a large development set with labels as
input (e.g., single numbers or one-hot vector). In this work, we propose a
diffusion model based system for Foley sound generation with text conditions.
To alleviate the data scarcity issue, our model is initially pre-trained with
large-scale datasets and fine-tuned to this task via transfer learning using
the contrastive language-audio pertaining (CLAP) technique. We have observed
that the feature embedding extracted by the text encoder can significantly
affect the performance of the generation model. Hence, we introduce a trainable
layer after the encoder to improve the text embedding produced by the encoder.
In addition, we further refine the generated waveform by generating multiple
candidate audio clips simultaneously and selecting the best one, which is
determined in terms of the similarity score between the embedding of the
candidate clips and the embedding of the target text label. Using the proposed
method, our system ranks ${1}^{st}$ among the systems submitted to DCASE
Challenge 2023 Task 7. The results of the ablation studies illustrate that the
proposed techniques significantly improve sound generation performance. The
codes for implementing the proposed system are available online.

Provided by:



Moderator and Editor