MOSI 数据集 Baseline 总结

本文按照时间线从（时间早的论文中提出的）老模型到（刚刚提出的论文中的）新模型的模型在 MOSI 两个数据集上的模型效果进行总结。

各个模型在 MOSEI 上的效果详见 MOSEI Baseline Summary

Note：本文按照论文时间先后顺序，依次总结各个经典论文中提到的 baseline 效果。

Model：MISA

Paper : MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Proposed Time : 2020-05
收录会议：首发于 arxiv 网站，后为 ACM MM 会议录用

Models	MAE ($\downarrow$)	Corr ($\uparrow$)	Acc-2 ($\uparrow$)	F-Score ($\uparrow$)	Acc-7 ($\uparrow$)
BC-LSTM	1.079	0.581	73.9 / -	73.9 / -	28.7
MV-LSTM	1.019	0.601	73.9 / -	74.0 / -	33.2
TFN	0.970	0.633	73.9 / -	73.4 / -	32.1
MARN	0.968	0.625	77.1 / -	77.0 / -	34.7
MFN	0.965	0.632	77.4 / -	77.3 / -	34.1
LMF	0.912	0.668	76.4 / -	75.7 / -	32.8
CH-Fusion	-	-	80.0 / -	-	-
MFM	0.951	0.662	78.1 / -	78.1 / -	36.2
RAVEN	0.915	0.691	78.0 / -	76.6 / -	33.2
RMFN	0.922	0.681	78.4 / -	78.0 / -	38.3
MCTN	0.909	0.676	79.3 / -	79.1 / -	35.6
CIA	0.914	0.689	79.8 / -	- / 79.5	38.9
HFFN	-	-	- / 80.2	- / 80.3	-
LMFN	-	-	- / 80.9	- / 80.9	-
ARGF	-	-	- / 81.4	- / 81.5	-
MulT	0.871	0.698	- / 83.0	- / 82.8	40.0
TFN (B)	0.901	0.698	- / 80.8	- / 80.7	34.9
LMF (B)	0.917	0.695	- / 82.5	- / 82.4	33.2
MFM (B)	0.877	0.706	- / 81.7	- / 81.6	35.4
ICCN (B)	0.860	0.710	- / 83.0	- / 83.0	39.0
MISA (B)	0.783	0.761	81.8† / 83.4†	81.7 / 83.6	42.3

Peroid One

这一时期，多模态问题逐渐进入人们的视野，然而提出的核心方法只是将原有处理单模态的进行简单扩展。很多 late fusion 的模型都是这一时间短出现的。

BC-LSTM & MV-LSTM 是对于 Multimodal 最早的尝试，目前来看已经失去了对比价值。
- BC-LSTM : Context-Dependent Sentiment Analysis in User-Generated Videos. 2017 ACL

Peroid Two

从 TFN 模型的出现开始，多模态的专用方法开始不断出现。

TFN：Tensor Fusion Network for Multimodal Sentiment Analysis. 2017 EMNLP
- LMF 是 TFN 效率上的优化，性能与 TFN 理论等效。
MARN：Multi-attention Recurrent Network for Human Communication Comprehension 2018 AAAI
MFN：Memory Fusion Network for Multi-view Sequential Learning .2018 AAAI
CH-Fusion: Multimodal sentiment analysis using hierarchical fusion with context modeling.
- a strong baseline which performs hierar- chical fusion by composing bi-modal interactions followed by tr-modal fusion. 2018 Knowledge Base System
MFM : Learning Factorized Multimodal Representations ICLR 2019
RAVEN: Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors
- 原论文中只提供了 MAE Corr Acc-2 指标（同源论文指标）2019 AAAI
RMFN: Multimodal language analysis with recurrent multistage fusion. 2018 EMNLP
MCTN: Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. 2018 AAAI
CIA: Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis. 2019 EMNLP-IJCNLP 2019
HFFN / LMFN / ARGF 这三个模型貌似很少出现在别的论文里，可能的原因是提供的评价标准不全。只有Acc-2和F1这两个指标。
- ARGF：Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion.
- HFFN: Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing. 2019 ACL
- LMFN: Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing. 2020 IEEE Trans
MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences
- 效果同原论文在 MOSI 对齐数据上提到的效果 2019 ACL
TFN (B):Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. 2019 arXiv
- 在 MAG_BERT 文章中也用到了 TFN (B) 效果有一定差异 (MAG_BERT中提到的TFN B 效果没这么好)。
LMF (B)： Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. 2019 arXiv
- LMF (an improvement on TFN.)源论文：Efficient Low-rank Multimodal Fusion With Modality-Specific Factors 2018 ACL
- 实现：https://github.com/Justin1904/Low-rank-Multimodal-Fusion （在我们的 MMSA 仓库中也有）
MFM (B) / ICCN (B)：同上，但目前我没有读过源论文。
- MFM 源论文：Learning Factorized Multimodal Representations. In 7th International Conference on Learning Representations, ICLR 2019,
- ICCN 源论文目前没有找到（之后有空会继续跟进）。
MISA (B) : 是本文提出的新模型。可以借鉴的地方很多，在实现上可以重点关注 自定义LOSS 的实现方式。

评估方法

MOSI 和 MOSEI 这两个数据集的评估方法包括：经典的回归任务的评价指标：mean absolute error (MAE) and Pearson correlation (Corr)。
分类准确度：
- Seven-class accuracy (Acc-7) ranging from −3 to 3
- Binary accuracy (Acc-2) and F-Score.
  
  Note：对于二分类精确度的定义有两种不同的方法：1. negative / non-negative 分类 2. 近期的任务往往倾向于使用更加准确的定义：negative / positive class.
  
  如果看到在二分类指标、F1值看到 xx / yy 的形式，左侧（xx）是 negative / non-negative 右侧（yy）是 negative / positive 指标。

Previous Models

The literature in MSA can be broadly classified into: 1) Utterance-level 2) Inter-utterance contextual models. While utterance-level algorithms consider a target utterance in isolation, contextual algorithms utilize neighboring utterances from the overall video.

Utterance-level baselines include:

Proposed works in this category have primarily focused on learning cross-modal dynamics using sophisticated fusion mechanisms.

Networks which perform temporal modeling and fusion of utter- ances: MFN, MARN, MV-LSTM, RMFN.
Models which utilize attention and transformer modules to improve token representations using non-verbal signals: RAVEN, MulT.
Graph-based fusion models: Graph-MFN.
Utterance-vector fusion approaches that use tensor-based fusion and low-rank variants: TFN, LMF, LMFN,HFFN.
Common subspace learning models that use cyclic translations (MCTN), adversarial auto-encoders (ARGF), and generative-discriminative factorized representations (MFM ).

Inter-utterance contextual baselines include:

These models utilize the context from surrounding utterances of the target utterance.

RNN-based models: BC-LSTM, with hierarchical fusion CH-Fusion.
Inter-utterance attention and multi-tasking models: CIA, CIM-MTL, DFF-ATMF.

State of the Art

For the task of Multimodal Sentiment Analysis,the Interaction Canonical Correlation Network (ICCN) stands as the state-of-the- art (SOTA) model on both MOSI and MOSEI. ICCN first extracts features from audio and video modality, and then fuses with text embeddings to get two outer products, text-audio and text-video. Finally, the outer products are fed to a Canonical Correlation Analysis (CCA) network, whose output is used for prediction.

For Multimodal Humor Detection, The SOTA is Contextual Memory Fusion Network (C-MFN), which extends the MFN model by proposing uni- and multimodal context networks that take preceding utterances into consideration and performs fusion using the MFN model as its backbone. Originally, MFN is a multi-view gated memory network that stores intra- and cross-modal utterance interactions in its memories.

Jason Yuan

http://Columbine21.github.io/2020/08/28/baselines/