机器之心从word2vec开始，说下GPT庞大的家族系谱( 四 )

为了更好地理解图 5 在做什么，我们可以再将上图中的各个符号跟我们前面的 Attention 中的三类向量联系起来：
在查询过程中，我们的目的是为了通过 h 和 s 的相关性来确定 h 在 context 矩阵中的权重，所以最上面的 s_t 就是 query 向量，用来做检索的；
如果理解了上一点和前面对 Attention 机制的解读，因此这里的 h_t 就很好理解了，它就是上文中的 key 和 value 向量。
LSTM 公司中的 Attention 机制虽然没有那么明显，但是其内部的 Gate 机制也算一定程度的 Attention ，其中 input gate 选择哪些当前信息进行输入， forget gate 选择遗忘哪些过去信息。 LSTM 号称可以解决长期依赖问题，但是实际上 LSTM 还是需要一步一步去捕捉序列信息，在长文本上的表现是会随着 step 增加而慢慢衰减，难以保留全部的有用信息。
总的来说， Attention 机制在外包阶段就是对所有 step 的 hidden state 进行加权，把注意力集中到整段文本中比较重要的 hidden state 信息。 Attention 除了给模型带来性能上的提升外，这些 Attention 值也可以用来可视化，从而观察哪些 step 是重要的，但是要小心过拟合，而且也增加了计算量。
自立门户 —— Self-attention
Attention 在外包自己的业务的时候，其优秀的外包方案引起了 Transformer 的注意， Transformer 开始考虑 Attention 公司的核心思想能不能自立为王呢？一不做二不休， Transformer 向自己的远房表亲 Attention 表达了这一想法，两者一拍即合，经过了辛苦的钻研后，他们兴奋地喊出了自己的口号——“Attention is all you need!”
然后， Transformer 公司应运而生了！
【中兴之祖】Transformer：Attention 就够了！
承前启后 —— self-attention
他们到底做了什么呢？简单来说，就是用了改良版的 self-attention 将 attention 从配角位置直接带到了主角位置。为了防止我的转述使大家的理解出现偏差，这里还是先贴上原文对于 Transformer 中各个组件的 attention 机制的介绍（为了方便解释，我稍微调整了一下顺序）：
The encoder contains self-attention layers. In a self-attention layer, all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to ?∞) all values in the input of the softmax which correspond to illegal connections.
In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.
可以看出 Transformer 主要提使用了两种 Attention——Self-attention 和 encoder-decoder attention 。这里的 Self-attention 主要是为了抛掉外力（LSTM ， RNN ， CNN 等）， encoder-decoder attention 则延续了前面的外包方案（图 5），作用跟过去相同，主要是为了连接 encoder 和 decoder 。这里的 Self Attention 顾名思义，指的不是 Target 和 Source 之间的 Attention 机制，而是 Source 内部元素之间或者 Target 内部元素之间发生的 Attention 机制，也可以理解为 Target=Source 这种特殊情况下的注意力计算机制[12] 。