MOSI 数据集 Baseline 总结
本文按照时间线从(时间早的论文中提出的)老模型到(刚刚提出的论文中的)新模型的模型在 MOSI 两个数据集上的模型效果进行总结。
各个模型在 MOSEI 上的效果详见 MOSEI Baseline Summary
Note:本文按照论文时间先后顺序,依次总结各个经典论文中提到的 baseline 效果。
Model:MISA
Paper : MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Proposed Time : 2020-05
- 收录会议:首发于 arxiv 网站,后为 ACM MM 会议录用
Models | MAE ($\downarrow$) | Corr ($\uparrow$) | Acc-2 ($\uparrow$) | F-Score ($\uparrow$) | Acc-7 ($\uparrow$) |
---|---|---|---|---|---|
BC-LSTM | 1.079 | 0.581 | 73.9 / - | 73.9 / - | 28.7 |
MV-LSTM | 1.019 | 0.601 | 73.9 / - | 74.0 / - | 33.2 |
TFN | 0.970 | 0.633 | 73.9 / - | 73.4 / - | 32.1 |
MARN | 0.968 | 0.625 | 77.1 / - | 77.0 / - | 34.7 |
MFN | 0.965 | 0.632 | 77.4 / - | 77.3 / - | 34.1 |
LMF | 0.912 | 0.668 | 76.4 / - | 75.7 / - | 32.8 |
CH-Fusion | - | - | 80.0 / - | - | - |
MFM | 0.951 | 0.662 | 78.1 / - | 78.1 / - | 36.2 |
RAVEN | 0.915 | 0.691 | 78.0 / - | 76.6 / - | 33.2 |
RMFN | 0.922 | 0.681 | 78.4 / - | 78.0 / - | *38.3* |
MCTN | 0.909 | 0.676 | 79.3 / - | 79.1 / - | 35.6 |
CIA | 0.914 | 0.689 | 79.8 / - | - / 79.5 | 38.9 |
HFFN | - | - | - / 80.2 | - / 80.3 | - |
LMFN | - | - | - / 80.9 | - / 80.9 | - |
ARGF | - | - | - / 81.4 | - / 81.5 | - |
MulT | 0.871 | 0.698 | - / 83.0 | - / 82.8 | 40.0 |
TFN (B) | 0.901 | 0.698 | - / 80.8 | - / 80.7 | 34.9 |
LMF (B) | 0.917 | 0.695 | - / 82.5 | - / 82.4 | 33.2 |
MFM (B) | 0.877 | 0.706 | - / 81.7 | - / 81.6 | 35.4 |
ICCN (B) | 0.860 | 0.710 | - / 83.0 | - / 83.0 | 39.0 |
MISA (B) | 0.783 | 0.761 | 81.8† / 83.4† | 81.7 / 83.6 | 42.3 |
Peroid One
这一时期,多模态问题逐渐进入人们的视野,然而提出的核心方法只是将原有处理单模态的进行简单扩展。很多 late fusion 的模型都是这一时间短出现的。
- BC-LSTM & MV-LSTM 是对于 Multimodal 最早的尝试,目前来看已经失去了对比价值。
- BC-LSTM : Context-Dependent Sentiment Analysis in User-Generated Videos. 2017 ACL
Peroid Two
从 TFN 模型的出现开始,多模态的专用方法开始不断出现。
- TFN:Tensor Fusion Network for Multimodal Sentiment Analysis. 2017 EMNLP
- LMF 是 TFN 效率上的优化,性能与 TFN 理论等效。
- MARN:Multi-attention Recurrent Network for Human Communication Comprehension 2018 AAAI
- MFN:Memory Fusion Network for Multi-view Sequential Learning .2018 AAAI
- CH-Fusion: Multimodal sentiment analysis using hierarchical fusion with context modeling.
- a strong baseline which performs hierar- chical fusion by composing bi-modal interactions followed by tr-modal fusion. 2018 Knowledge Base System
- MFM : Learning Factorized Multimodal Representations ICLR 2019
- RAVEN: Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors
- 原论文中只提供了 MAE Corr Acc-2 指标(同源论文指标)2019 AAAI
- RMFN: Multimodal language analysis with recurrent multistage fusion. 2018 EMNLP
- MCTN: Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. 2018 AAAI
- CIA: Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis. 2019 EMNLP-IJCNLP 2019
- HFFN / LMFN / ARGF 这三个模型貌似很少出现在别的论文里,可能的原因是提供的评价标准不全。只有Acc-2和F1这两个指标。
- ARGF:Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion.
- HFFN: Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing. 2019 ACL
- LMFN: Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing. 2020 IEEE Trans
- MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences
- 效果同原论文在 MOSI 对齐数据上提到的效果 2019 ACL
- TFN (B):Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. 2019 arXiv
- 在 MAG_BERT 文章中也用到了 TFN (B) 效果有一定差异 (MAG_BERT中提到的TFN B 效果没这么好)。
- LMF (B): Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. 2019 arXiv
- LMF (an improvement on TFN.)源论文:Efficient Low-rank Multimodal Fusion With Modality-Specific Factors 2018 ACL
- 实现:https://github.com/Justin1904/Low-rank-Multimodal-Fusion (在我们的 MMSA 仓库中也有)
- MFM (B) / ICCN (B):同上,但目前我没有读过源论文。
- MFM 源论文:Learning Factorized Multimodal Representations. In 7th International Conference on Learning Representations, ICLR 2019,
- ICCN 源论文 目前没有找到(之后有空会继续跟进)。
- MISA (B) : 是本文提出的新模型。可以借鉴的地方很多,在实现上可以重点关注 自定义LOSS 的实现方式。
评估方法
MOSI 和 MOSEI 这两个数据集的评估方法包括:经典的回归任务的评价指标:mean absolute error (MAE) and Pearson correlation (Corr)。
分类准确度:
Seven-class accuracy (Acc-7) ranging from −3 to 3
Binary accuracy (Acc-2) and F-Score.
Note:对于二分类精确度的定义有两种不同的方法:1. negative / non-negative 分类 2. 近期的任务往往倾向于使用更加准确的定义:negative / positive class.
如果看到在二分类指标、F1值看到 xx / yy 的形式,左侧(xx)是 negative / non-negative 右侧(yy)是 negative / positive 指标。
Previous Models
The literature in MSA can be broadly classified into: 1) Utterance-level 2) Inter-utterance contextual models. While utterance-level algorithms consider a target utterance in isolation, contextual algorithms utilize neighboring utterances from the overall video.
Utterance-level baselines include:
Proposed works in this category have primarily focused on learning cross-modal dynamics using sophisticated fusion mechanisms.
- Networks which perform temporal modeling and fusion of utter- ances: MFN, MARN, MV-LSTM, RMFN.
- Models which utilize attention and transformer modules to improve token representations using non-verbal signals: RAVEN, MulT.
- Graph-based fusion models: Graph-MFN.
- Utterance-vector fusion approaches that use tensor-based fusion and low-rank variants: TFN, LMF, LMFN,HFFN.
- Common subspace learning models that use cyclic translations (MCTN), adversarial auto-encoders (ARGF), and generative-discriminative factorized representations (MFM ).
Inter-utterance contextual baselines include:
These models utilize the context from surrounding utterances of the target utterance.
- RNN-based models: BC-LSTM, with hierarchical fusion CH-Fusion.
- Inter-utterance attention and multi-tasking models: CIA, CIM-MTL, DFF-ATMF.
State of the Art
For the task of Multimodal Sentiment Analysis,the Interaction Canonical Correlation Network (ICCN) stands as the state-of-the- art (SOTA) model on both MOSI and MOSEI. ICCN first extracts features from audio and video modality, and then fuses with text embeddings to get two outer products, text-audio and text-video. Finally, the outer products are fed to a Canonical Correlation Analysis (CCA) network, whose output is used for prediction.
For Multimodal Humor Detection, The SOTA is Contextual Memory Fusion Network (C-MFN), which extends the MFN model by proposing uni- and multimodal context networks that take preceding utterances into consideration and performs fusion using the MFN model as its backbone. Originally, MFN is a multi-view gated memory network that stores intra- and cross-modal utterance interactions in its memories.