Time Series Forecasting with Transformer

论文：Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

（1）Background

The well-known self-attention based Transformer [1] has recently been proposed for sequence modeling with its fabulous capablility of capturing long time dependence. . But it still suffers some Weaknesses:

我们都知道，transformer 使用的 自注意力 机制相当于将句子中两个词的距离设定为 1。尽管作者尝试亡羊补牢，在 word encoding 的过程中加入了 position encoding 提供单词的位置信息。然而，在时序预测问题这种局部信息极为关键的问题中，仍然会因为其 insensitive to local context ，而带来负面影响。
传统的 transformer 占用的内存正比于句子长度 L 的平方。对于时间序列预测问题，输入的时间序列可能很长，导致memory bottleneck。（个人的见解就是内存的问题过个10年就都不是问题哈）

（2）Contribute

使用 transformer 进行时序预测，大量实验，效果好
提出 convolutional self-attention 机制产生自注意力机制中的 Q，K，使得 Query-key 更加关注局部信息。
LogSparse Transformer (space commplexity: $O(L\cdot(\log L)^2)$ ) 使得其可以用于更加细粒度的时序预测问题。

（3）Problem Definition

对于 $N$ 个不同时间序列 $\{z_{i,1:t_0}\}_{i=1}^N$ , 每一个时间序列 $z_{i,1:t_0}􏰄 = [z_{i,1}, z_{i,2}\cdots,z_{i,t_0}]$ ，我们的任务是预测接下来的 $\tau$ time steps 的序列值 $\{z_{i,t_0+1:t_0+\tau}\}_{i=1}^N$ 。除了这些，我们还已知 $\{x_{i,1:t_0+\tau}\}_{i=1}^N$ 各个时间点的 d 维的输入向量。我们的目标是建模一下条件分布：

$p(z_{i,t_0+1:t_0+\tau}\mid z_{i,1:t_0}, x_{i,1:t_0+\tau}; \phi) = \Pi_{t=t_0=1}^{t_0+\tau} p(z_{i,t}\mid z_{i,1:t-1}, x_{i,1:t}; \phi) \tag{1}$

问题可以简化为预测下一个时间节点的数据 $p(z_{t}|z_{1:t-1},x_{1:t};\phi)$ ，其中 $\phi$ 是模型参数 shared by all time series in the collection。

我们将时序测量值 $z_{t-1}$ 与 covariates $x_t$ 拼接起来共同作为模型在 $t$ 时刻的输入即

$y_t = [z_{t-1} \circ x_t] \in R^{d+1}; Y_t = [y_1, y_2, \cdots,y_t] \in R^{t\times (d+1)} \tag{2}$

（4） Methodology

（4.1）Enhancing the locality of Transformer

个人感觉：模型的主要灵感来于将 self-attention 模型中的输入 x 到 Q K V 的过程理解为 $1\times 1$ 的卷积操作，然后将这种卷积进行扩展，（很像 convlstm 和 lstm 的关系）。

上图展示了文章提出的模型与canonical Transformer模型的对比。区别在于Conv1d中将kernal size 大小从 1 调整到 k，使得Q，K在构造的时候考虑之前的 k 个context。使得模型获得了捕捉 shape 相似的能力。（canonical Transformer 仅关注的是输入时刻单点与之前那个单点更加相似，而这里关注的是k lag的时间段内信息与之前那个时间段信息更加相似），显然模型有一定合理性。