상세 보기
- Hong, Jin;
- Park, Juhyeon;
- Kwon, Junseok
WEB OF SCIENCE
0SCOPUS
0초록
In this paper, we present DiffMusic, a novel methodology for generating high-quality music from a single image. Existing methodologies have achieved multi-modality by integrating data from various domains or employing high-cost approaches, such as Large Language Models (LLMs), to generate high-quality music. The proposed DiffMusic adopts a diffusion-based approach to generate music descriptions from a single image, aiming to solve the issues encountered in traditional music generation methods. Unlike traditional image captioning, this approach does not simply describe the scene within the image. Instead, it generates descriptions that capture genre-related, melodic, and rhythmic elements essential for music generation. This approach enables the direct conversion of an image into music with a single inference process while allowing seamless integration into various existing music generators. Experimental results demonstrate that our method outperforms existing approaches by 29.9% in terms of Frechet Audio Distance, while remaining more cost-effective.
키워드
- 제목
- DiffMusic: Efficient Music Generation From a Single Image Using Diffusion-Based Representations
- 저자
- Hong, Jin; Park, Juhyeon; Kwon, Junseok
- 발행일
- 2026
- 유형
- Article
- 저널명
- IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING
- 권
- 34
- 페이지
- 1126 ~ 1136