DiffMusic: Efficient Music Generation From a Single Image Using Diffusion-Based Representations
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

In this paper, we present DiffMusic, a novel methodology for generating high-quality music from a single image. Existing methodologies have achieved multi-modality by integrating data from various domains or employing high-cost approaches, such as Large Language Models (LLMs), to generate high-quality music. The proposed DiffMusic adopts a diffusion-based approach to generate music descriptions from a single image, aiming to solve the issues encountered in traditional music generation methods. Unlike traditional image captioning, this approach does not simply describe the scene within the image. Instead, it generates descriptions that capture genre-related, melodic, and rhythmic elements essential for music generation. This approach enables the direct conversion of an image into music with a single inference process while allowing seamless integration into various existing music generators. Experimental results demonstrate that our method outperforms existing approaches by 29.9% in terms of Frechet Audio Distance, while remaining more cost-effective.

키워드

VisualizationMusicSpeech processingVideosTrainingDiffusion modelsVectorsInstrumentsTranslationTransformersMusic generationdiffusion-based representationslarge language models
제목
DiffMusic: Efficient Music Generation From a Single Image Using Diffusion-Based Representations
저자
Hong, JinPark, JuhyeonKwon, Junseok
DOI
10.1109/TASLPRO.2026.3660263
발행일
2026
유형
Article
저널명
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING
34
페이지
1126 ~ 1136