AImoclips: Benchmarking Emotion Conveyance in Text-to-Music Generation Using a Dimensional Valence-Arousal Framework
AImoclips is a comprehensive benchmark dataset for evaluating how well text-to-music (TTM) generation systems convey intended emotions to human listeners. The dataset consists of 991 10-second music clips generated by six different TTM models, each annotated with valence and arousal ratings from multiple human evaluators.
This benchmark addresses the critical question of emotion conveyance in AI-generated music by providing:
- 991 music clips across 12 distinct emotions
- 6 TTM models including both open-source and commercial systems
- Human annotations with valence-arousal ratings from multiple evaluators
- Reference emotion intents with established valence-arousal coordinates
The dataset enables researchers to evaluate and compare TTM models based on their ability to generate music that successfully conveys intended emotions to human listeners.
AImoclips/
├── README.md # This file
├── clips_metadata.csv # Main dataset with clip ratings and metadata
├── ratings_metadata.csv # Individual ratings from each annotator
├── emotion_intents_metadata.csv # Reference emotion coordinates
└── music_clips/ # Audio files organized by model
├── AudioLDM2/ # 168 clips from AudioLDM2
│ ├── AudioLDM2_angry_1.wav
│ ├── AudioLDM2_calm_1.wav
│ └── ...
├── MusicGen/ # 168 clips from MusicGen
│ ├── MusicGen_angry_1.wav
│ ├── MusicGen_calm_1.wav
│ └── ...
├── Mustango/ # 168 clips from Mustango
│ ├── Mustango_angry_1.wav
│ ├── Mustango_calm_1.wav
│ └── ...
├── StableAudio/ # 168 clips from StableAudio
│ ├── StableAudio_angry_1.wav
│ ├── StableAudio_calm_1.wav
│ └── ...
├── Suno/ # 168 clips from Suno
│ ├── Suno_angry_1.wav
│ ├── Suno_calm_1.wav
│ └── ...
└── Udio/ # 168 clips from Udio
├── Udio_angry_1.wav
├── Udio_calm_1.wav
└── ...
- AudioLDM 2: open-source
- MusicGen: open-source
- Mustango: open-source
- Stable Audio Open v1.0: open-source
- Suno v4.5: commercial
- Udio v1.5 Allegro: commercial
The dataset includes 12 distinct emotions distributed across four quadrants of the valence-arousal space:
- High Valence, High Arousal: happy, excited, energetic
- Low Valence, High Arousal: angry, anxious, scared
- Low Valence, Low Arousal: sad, gloomy, dull
- High Valence, Low Arousal: relaxed, calm, tranquil
Audio files follow the pattern: {Model}_{Emotion}_{Index}.wav
Model: One of the six TTM modelsEmotion: Target emotion for generationIndex: Clip number (1-14) for that model-emotion combination
This is the main dataset file containing human ratings and metadata for all 991 clips.
| Column | Type | Description |
|---|---|---|
index |
integer | Unique identifier for each clip (1-991) |
audio_file |
string | Filename of the audio clip |
model |
string | TTM model that generated the clip |
emotion |
string | Target emotion intent used for generation |
file_index |
integer | Index number for this emotion-model combination (1-14) |
valence |
float | Average valence rating from human annotators (1-9 scale) |
arousal |
float | Average arousal rating from human annotators (1-9 scale) |
num_annotators |
integer | Number of human annotators who rated this clip |
This file contains the reference valence-arousal coordinates for each emotion, establishing the intended emotional targets.
| Column | Type | Description |
|---|---|---|
quadrant |
string | Valence-arousal quadrant (hv_ha, lv_ha, lv_la, hv_la) |
emotion |
string | Emotion intent |
gt_valence |
float | Reference valence coordinate (1-9 scale) |
gt_arousal |
float | Reference arousal coordinate (1-9 scale) |
Quadrant codes:
hv_ha: High Valence, High Arousallv_ha: Low Valence, High Arousallv_la: Low Valence, Low Arousalhv_la: High Valence, Low Arousal
This file contains individual valence and arousal ratings from each human annotator for every audio clip, with anonymized user identifiers.
| Column | Type | Description |
|---|---|---|
user |
string | Anonymized participant identifier (P1, P2, ..., P111) |
audio_file |
string | Filename of the audio clip |
valence |
integer | Valence rating from the annotator (1-9 scale) |
arousal |
integer | Arousal rating from the annotator (1-9 scale) |
import pandas as pd
# Load clip ratings and metadata
clips_df = pd.read_csv('clips_metadata.csv')
# Load individual annotator ratings
ratings_df = pd.read_csv('ratings_metadata.csv')
# Load reference emotion coordinates
emotions_df = pd.read_csv('emotion_intents_metadata.csv')
print(f"Dataset contains {len(clips_df)} clips from {clips_df['model'].nunique()} models")
print(f"Emotions covered: {sorted(clips_df['emotion'].unique())}")
print(f"Individual ratings: {len(ratings_df)} from {ratings_df['user'].nunique()} annotators")The AImoclips dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
The preprint describing this dataset is currently available on arXiv: AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation.
If you have any issue with this dataset, feel free to contact here: