AImoclips: Benchmarking Emotion Conveyance in Text-to-Music Generation Using a Dimensional Valence-Arousal Framework

AImoclips is a comprehensive benchmark dataset for evaluating how well text-to-music (TTM) generation systems convey intended emotions to human listeners. The dataset consists of 991 10-second music clips generated by six different TTM models, each annotated with valence and arousal ratings from multiple human evaluators.

Overview

This benchmark addresses the critical question of emotion conveyance in AI-generated music by providing:

991 music clips across 12 distinct emotions
6 TTM models including both open-source and commercial systems
Human annotations with valence-arousal ratings from multiple evaluators
Reference emotion intents with established valence-arousal coordinates

The dataset enables researchers to evaluate and compare TTM models based on their ability to generate music that successfully conveys intended emotions to human listeners.

Project Structure

AImoclips/
├── README.md                      # This file
├── clips_metadata.csv             # Main dataset with clip ratings and metadata
├── ratings_metadata.csv           # Individual ratings from each annotator
├── emotion_intents_metadata.csv   # Reference emotion coordinates
└── music_clips/                   # Audio files organized by model
    ├── AudioLDM2/                 # 168 clips from AudioLDM2
    │   ├── AudioLDM2_angry_1.wav
    │   ├── AudioLDM2_calm_1.wav
    │   └── ...
    ├── MusicGen/                  # 168 clips from MusicGen
    │   ├── MusicGen_angry_1.wav
    │   ├── MusicGen_calm_1.wav
    │   └── ...
    ├── Mustango/                  # 168 clips from Mustango
    │   ├── Mustango_angry_1.wav
    │   ├── Mustango_calm_1.wav
    │   └── ...
    ├── StableAudio/               # 168 clips from StableAudio
    │   ├── StableAudio_angry_1.wav
    │   ├── StableAudio_calm_1.wav
    │   └── ...
    ├── Suno/                      # 168 clips from Suno
    │   ├── Suno_angry_1.wav
    │   ├── Suno_calm_1.wav
    │   └── ...
    └── Udio/                      # 168 clips from Udio
        ├── Udio_angry_1.wav
        ├── Udio_calm_1.wav
        └── ...

Dataset Details

Models Evaluated

AudioLDM 2: open-source
MusicGen: open-source
Mustango: open-source
Stable Audio Open v1.0: open-source
Suno v4.5: commercial
Udio v1.5 Allegro: commercial

Emotions Covered

The dataset includes 12 distinct emotions distributed across four quadrants of the valence-arousal space:

High Valence, High Arousal: happy, excited, energetic
Low Valence, High Arousal: angry, anxious, scared
Low Valence, Low Arousal: sad, gloomy, dull
High Valence, Low Arousal: relaxed, calm, tranquil

File Naming Convention

Audio files follow the pattern: {Model}_{Emotion}_{Index}.wav

Model: One of the six TTM models
Emotion: Target emotion for generation
Index: Clip number (1-14) for that model-emotion combination

Data Files

clips_metadata.csv

This is the main dataset file containing human ratings and metadata for all 991 clips.

Column	Type	Description
`index`	integer	Unique identifier for each clip (1-991)
`audio_file`	string	Filename of the audio clip
`model`	string	TTM model that generated the clip
`emotion`	string	Target emotion intent used for generation
`file_index`	integer	Index number for this emotion-model combination (1-14)
`valence`	float	Average valence rating from human annotators (1-9 scale)
`arousal`	float	Average arousal rating from human annotators (1-9 scale)
`num_annotators`	integer	Number of human annotators who rated this clip

emotion_intents_metadata.csv

This file contains the reference valence-arousal coordinates for each emotion, establishing the intended emotional targets.

Column	Type	Description
`quadrant`	string	Valence-arousal quadrant (hv_ha, lv_ha, lv_la, hv_la)
`emotion`	string	Emotion intent
`gt_valence`	float	Reference valence coordinate (1-9 scale)
`gt_arousal`	float	Reference arousal coordinate (1-9 scale)

Quadrant codes:

hv_ha: High Valence, High Arousal
lv_ha: Low Valence, High Arousal
lv_la: Low Valence, Low Arousal
hv_la: High Valence, Low Arousal

ratings_metadata.csv

This file contains individual valence and arousal ratings from each human annotator for every audio clip, with anonymized user identifiers.

Column	Type	Description
`user`	string	Anonymized participant identifier (P1, P2, ..., P111)
`audio_file`	string	Filename of the audio clip
`valence`	integer	Valence rating from the annotator (1-9 scale)
`arousal`	integer	Arousal rating from the annotator (1-9 scale)

Usage

Loading the Data

import pandas as pd

# Load clip ratings and metadata
clips_df = pd.read_csv('clips_metadata.csv')

# Load individual annotator ratings
ratings_df = pd.read_csv('ratings_metadata.csv')

# Load reference emotion coordinates  
emotions_df = pd.read_csv('emotion_intents_metadata.csv')

print(f"Dataset contains {len(clips_df)} clips from {clips_df['model'].nunique()} models")
print(f"Emotions covered: {sorted(clips_df['emotion'].unique())}")
print(f"Individual ratings: {len(ratings_df)} from {ratings_df['user'].nunique()} annotators")

License

The AImoclips dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

The preprint describing this dataset is currently available on arXiv: AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation.

Contact

If you have any issue with this dataset, feel free to contact here:

rotation@kaist.ac.kr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AImoclips: Benchmarking Emotion Conveyance in Text-to-Music Generation Using a Dimensional Valence-Arousal Framework

Overview

Project Structure

Dataset Details

Models Evaluated

Emotions Covered

File Naming Convention

Data Files

clips_metadata.csv

emotion_intents_metadata.csv

ratings_metadata.csv

Usage

Loading the Data

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
music_clips		music_clips
.gitignore		.gitignore
README.md		README.md
clips_metadata.csv		clips_metadata.csv
emotion_intents_metadata.csv		emotion_intents_metadata.csv
ratings_metadata.csv		ratings_metadata.csv

Folders and files

Latest commit

History

Repository files navigation

AImoclips: Benchmarking Emotion Conveyance in Text-to-Music Generation Using a Dimensional Valence-Arousal Framework

Overview

Project Structure

Dataset Details

Models Evaluated

Emotions Covered

File Naming Convention

Data Files

clips_metadata.csv

emotion_intents_metadata.csv

ratings_metadata.csv

Usage

Loading the Data

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages