Baselines On MOSI Dataset


MOSI 数据集 Baseline 总结

本文按照时间线从(时间早的论文中提出的)老模型到(刚刚提出的论文中的)新模型的模型在 MOSI 两个数据集上的模型效果进行总结。

各个模型在 MOSEI 上的效果详见 MOSEI Baseline Summary

Note:本文按照论文时间先后顺序,依次总结各个经典论文中提到的 baseline 效果。

Model:MISA

  • Paper : MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

  • Proposed Time : 2020-05

  • 收录会议:首发于 arxiv 网站,后为 ACM MM 会议录用
Models MAE ($\downarrow$) Corr ($\uparrow$) Acc-2 ($\uparrow$) F-Score ($\uparrow$) Acc-7 ($\uparrow$)
BC-LSTM 1.079 0.581 73.9 / - 73.9 / - 28.7
MV-LSTM 1.019 0.601 73.9 / - 74.0 / - 33.2
TFN 0.970 0.633 73.9 / - 73.4 / - 32.1
MARN 0.968 0.625 77.1 / - 77.0 / - 34.7
MFN 0.965 0.632 77.4 / - 77.3 / - 34.1
LMF 0.912 0.668 76.4 / - 75.7 / - 32.8
CH-Fusion - - 80.0 / - - -
MFM 0.951 0.662 78.1 / - 78.1 / - 36.2
RAVEN 0.915 0.691 78.0 / - 76.6 / - 33.2
RMFN 0.922 0.681 78.4 / - 78.0 / - *38.3*
MCTN 0.909 0.676 79.3 / - 79.1 / - 35.6
CIA 0.914 0.689 79.8 / - - / 79.5 38.9
HFFN - - - / 80.2 - / 80.3 -
LMFN - - - / 80.9 - / 80.9 -
ARGF - - - / 81.4 - / 81.5 -
MulT 0.871 0.698 - / 83.0 - / 82.8 40.0
TFN (B) 0.901 0.698 - / 80.8 - / 80.7 34.9
LMF (B) 0.917 0.695 - / 82.5 - / 82.4 33.2
MFM (B) 0.877 0.706 - / 81.7 - / 81.6 35.4
ICCN (B) 0.860 0.710 - / 83.0 - / 83.0 39.0
MISA (B) 0.783 0.761 81.8† / 83.4† 81.7 / 83.6 42.3

Peroid One

这一时期,多模态问题逐渐进入人们的视野,然而提出的核心方法只是将原有处理单模态的进行简单扩展。很多 late fusion 的模型都是这一时间短出现的。

  • BC-LSTM & MV-LSTM 是对于 Multimodal 最早的尝试,目前来看已经失去了对比价值。
    • BC-LSTM : Context-Dependent Sentiment Analysis in User-Generated Videos. 2017 ACL

Peroid Two

从 TFN 模型的出现开始,多模态的专用方法开始不断出现。

  • TFN:Tensor Fusion Network for Multimodal Sentiment Analysis. 2017 EMNLP
    • LMF 是 TFN 效率上的优化,性能与 TFN 理论等效。
  • MARN:Multi-attention Recurrent Network for Human Communication Comprehension 2018 AAAI
  • MFN:Memory Fusion Network for Multi-view Sequential Learning .2018 AAAI
  • CH-Fusion: Multimodal sentiment analysis using hierarchical fusion with context modeling.
    • a strong baseline which performs hierar- chical fusion by composing bi-modal interactions followed by tr-modal fusion. 2018 Knowledge Base System
  • MFM : Learning Factorized Multimodal Representations ICLR 2019
  • RAVEN: Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors
    • 原论文中只提供了 MAE Corr Acc-2 指标(同源论文指标)2019 AAAI
  • RMFN: Multimodal language analysis with recurrent multistage fusion. 2018 EMNLP
  • MCTN: Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. 2018 AAAI
  • CIA: Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis. 2019 EMNLP-IJCNLP 2019
  • HFFN / LMFN / ARGF 这三个模型貌似很少出现在别的论文里,可能的原因是提供的评价标准不全。只有Acc-2和F1这两个指标。
    • ARGF:Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion.
    • HFFN: Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing. 2019 ACL
    • LMFN: Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing. 2020 IEEE Trans
  • MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences
    • 效果同原论文在 MOSI 对齐数据上提到的效果 2019 ACL
  • TFN (B):Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. 2019 arXiv
    • 在 MAG_BERT 文章中也用到了 TFN (B) 效果有一定差异 (MAG_BERT中提到的TFN B 效果没这么好)。
  • LMF (B): Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. 2019 arXiv
  • MFM (B) / ICCN (B):同上,但目前我没有读过源论文。
    • MFM 源论文:Learning Factorized Multimodal Representations. In 7th International Conference on Learning Representations, ICLR 2019,
    • ICCN 源论文 目前没有找到(之后有空会继续跟进)。
  • MISA (B) : 是本文提出的新模型。可以借鉴的地方很多,在实现上可以重点关注 自定义LOSS 的实现方式。

评估方法

  • MOSI 和 MOSEI 这两个数据集的评估方法包括:经典的回归任务的评价指标:mean absolute error (MAE) and Pearson correlation (Corr)。

  • 分类准确度:

    • Seven-class accuracy (Acc-7) ranging from −3 to 3

    • Binary accuracy (Acc-2) and F-Score.

      Note:对于二分类精确度的定义有两种不同的方法:1. negative / non-negative 分类 2. 近期的任务往往倾向于使用更加准确的定义:negative / positive class.

      如果看到在二分类指标、F1值看到 xx / yy 的形式,左侧(xx)是 negative / non-negative 右侧(yy)是 negative / positive 指标。

Previous Models

The literature in MSA can be broadly classified into: 1) Utterance-level 2) Inter-utterance contextual models. While utterance-level algorithms consider a target utterance in isolation, contextual algorithms utilize neighboring utterances from the overall video.

Utterance-level baselines include:

Proposed works in this category have primarily focused on learning cross-modal dynamics using sophisticated fusion mechanisms.

  • Networks which perform temporal modeling and fusion of utter- ances: MFN, MARN, MV-LSTM, RMFN.
  • Models which utilize attention and transformer modules to improve token representations using non-verbal signals: RAVEN, MulT.
  • Graph-based fusion models: Graph-MFN.
  • Utterance-vector fusion approaches that use tensor-based fusion and low-rank variants: TFN, LMF, LMFN,HFFN.
  • Common subspace learning models that use cyclic translations (MCTN), adversarial auto-encoders (ARGF), and generative-discriminative factorized representations (MFM ).

Inter-utterance contextual baselines include:

These models utilize the context from surrounding utterances of the target utterance.

  • RNN-based models: BC-LSTM, with hierarchical fusion CH-Fusion.
  • Inter-utterance attention and multi-tasking models: CIA, CIM-MTL, DFF-ATMF.

State of the Art

For the task of Multimodal Sentiment Analysis,the Interaction Canonical Correlation Network (ICCN) stands as the state-of-the- art (SOTA) model on both MOSI and MOSEI. ICCN first extracts features from audio and video modality, and then fuses with text embeddings to get two outer products, text-audio and text-video. Finally, the outer products are fed to a Canonical Correlation Analysis (CCA) network, whose output is used for prediction.

For Multimodal Humor Detection, The SOTA is Contextual Memory Fusion Network (C-MFN), which extends the MFN model by proposing uni- and multimodal context networks that take preceding utterances into consideration and performs fusion using the MFN model as its backbone. Originally, MFN is a multi-view gated memory network that stores intra- and cross-modal utterance interactions in its memories.


文章作者: Jason Yuan
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Jason Yuan !
  目录