Skip to content

HunRotation/AImoclips

Repository files navigation

AImoclips: Benchmarking Emotion Conveyance in Text-to-Music Generation Using a Dimensional Valence-Arousal Framework

AImoclips is a comprehensive benchmark dataset for evaluating how well text-to-music (TTM) generation systems convey intended emotions to human listeners. The dataset consists of 991 10-second music clips generated by six different TTM models, each annotated with valence and arousal ratings from multiple human evaluators.

Overview

This benchmark addresses the critical question of emotion conveyance in AI-generated music by providing:

  • 991 music clips across 12 distinct emotions
  • 6 TTM models including both open-source and commercial systems
  • Human annotations with valence-arousal ratings from multiple evaluators
  • Reference emotion intents with established valence-arousal coordinates

The dataset enables researchers to evaluate and compare TTM models based on their ability to generate music that successfully conveys intended emotions to human listeners.

Project Structure

AImoclips/
├── README.md                      # This file
├── clips_metadata.csv             # Main dataset with clip ratings and metadata
├── ratings_metadata.csv           # Individual ratings from each annotator
├── emotion_intents_metadata.csv   # Reference emotion coordinates
└── music_clips/                   # Audio files organized by model
    ├── AudioLDM2/                 # 168 clips from AudioLDM2
    │   ├── AudioLDM2_angry_1.wav
    │   ├── AudioLDM2_calm_1.wav
    │   └── ...
    ├── MusicGen/                  # 168 clips from MusicGen
    │   ├── MusicGen_angry_1.wav
    │   ├── MusicGen_calm_1.wav
    │   └── ...
    ├── Mustango/                  # 168 clips from Mustango
    │   ├── Mustango_angry_1.wav
    │   ├── Mustango_calm_1.wav
    │   └── ...
    ├── StableAudio/               # 168 clips from StableAudio
    │   ├── StableAudio_angry_1.wav
    │   ├── StableAudio_calm_1.wav
    │   └── ...
    ├── Suno/                      # 168 clips from Suno
    │   ├── Suno_angry_1.wav
    │   ├── Suno_calm_1.wav
    │   └── ...
    └── Udio/                      # 168 clips from Udio
        ├── Udio_angry_1.wav
        ├── Udio_calm_1.wav
        └── ...

Dataset Details

Models Evaluated

  • AudioLDM 2: open-source
  • MusicGen: open-source
  • Mustango: open-source
  • Stable Audio Open v1.0: open-source
  • Suno v4.5: commercial
  • Udio v1.5 Allegro: commercial

Emotions Covered

The dataset includes 12 distinct emotions distributed across four quadrants of the valence-arousal space:

  • High Valence, High Arousal: happy, excited, energetic
  • Low Valence, High Arousal: angry, anxious, scared
  • Low Valence, Low Arousal: sad, gloomy, dull
  • High Valence, Low Arousal: relaxed, calm, tranquil

File Naming Convention

Audio files follow the pattern: {Model}_{Emotion}_{Index}.wav

  • Model: One of the six TTM models
  • Emotion: Target emotion for generation
  • Index: Clip number (1-14) for that model-emotion combination

Data Files

clips_metadata.csv

This is the main dataset file containing human ratings and metadata for all 991 clips.

Column Type Description
index integer Unique identifier for each clip (1-991)
audio_file string Filename of the audio clip
model string TTM model that generated the clip
emotion string Target emotion intent used for generation
file_index integer Index number for this emotion-model combination (1-14)
valence float Average valence rating from human annotators (1-9 scale)
arousal float Average arousal rating from human annotators (1-9 scale)
num_annotators integer Number of human annotators who rated this clip

emotion_intents_metadata.csv

This file contains the reference valence-arousal coordinates for each emotion, establishing the intended emotional targets.

Column Type Description
quadrant string Valence-arousal quadrant (hv_ha, lv_ha, lv_la, hv_la)
emotion string Emotion intent
gt_valence float Reference valence coordinate (1-9 scale)
gt_arousal float Reference arousal coordinate (1-9 scale)

Quadrant codes:

  • hv_ha: High Valence, High Arousal
  • lv_ha: Low Valence, High Arousal
  • lv_la: Low Valence, Low Arousal
  • hv_la: High Valence, Low Arousal

ratings_metadata.csv

This file contains individual valence and arousal ratings from each human annotator for every audio clip, with anonymized user identifiers.

Column Type Description
user string Anonymized participant identifier (P1, P2, ..., P111)
audio_file string Filename of the audio clip
valence integer Valence rating from the annotator (1-9 scale)
arousal integer Arousal rating from the annotator (1-9 scale)

Usage

Loading the Data

import pandas as pd

# Load clip ratings and metadata
clips_df = pd.read_csv('clips_metadata.csv')

# Load individual annotator ratings
ratings_df = pd.read_csv('ratings_metadata.csv')

# Load reference emotion coordinates  
emotions_df = pd.read_csv('emotion_intents_metadata.csv')

print(f"Dataset contains {len(clips_df)} clips from {clips_df['model'].nunique()} models")
print(f"Emotions covered: {sorted(clips_df['emotion'].unique())}")
print(f"Individual ratings: {len(ratings_df)} from {ratings_df['user'].nunique()} annotators")

License

The AImoclips dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

The preprint describing this dataset is currently available on arXiv: AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation.

Contact

If you have any issue with this dataset, feel free to contact here:

rotation@kaist.ac.kr

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors