Multimodal Language Analysis Datasets

Multimodal Dataset

Summary from Paper : Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

Previous Dataset Shortcoming

  • Diversity in the training samples:
    • Previously proposed datasets for multimodal language are generally small in size due to difficulties associated with data acquisition and costs of annotations.
    • The diversity in training samples is crucial for comprehensive multimodal language studies due to the complex- ity of the underlying distribution.
  • Variety in the topics:
    • Models trained on only few topics gener- alize poorly as language and nonverbal behaviors tend to change based on the impression of the topic on speakers’ internal mental state
  • Diversity of speakers:
    • speaking styles are highly idiosyncratic.
  • Variety in annotations:
    • Having multiple labels to predict allows for studying the relations between labels.
    • Another positive aspect of having variety of labels is allowing for multi-task learning which has shown excellent performance in past research.

Dataset Comparision


  • 2199 opinion video clips each annotated with sentiment in the range [-3,3]

(*) CMU-MOSEI (this paper)

  • contains 23,453 annotated video segments from 1,000 distinct speakers and 250 topics.
  • contains manual transcription aligned with audio to phoneme level.


  • consist online social review videos annotated at the video level for sentiment.

YouTube 2011

  • contains video form YouTube that span a wide range of product reviews and opinion videos.

MOUD 2013

  • consists of product review videos in Spanish. Each video consists of multiple segments labeled to display positive, negative or neutral sentiment.


  • consists of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset.
  • each segment 9 emotions as well as valence arousal and dominance.

Common Baseline

  • MFN : (Memory Fusion Network 2018)
  • MARN: (Multi-attention Recurrent Network 2018b)
  • TFN: (Tensor Fusion Network 2017)
  • MV-LSTM(Multi-View LSTM 2016)
  • EF-LSTM(Early Fusion LSTM 2013)

文章作者: Jason Yuan
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Jason Yuan !