Skip to content

Commit

Permalink
Add AFT
Browse files Browse the repository at this point in the history
  • Loading branch information
patrick-llgc committed Oct 22, 2023
1 parent c471c22 commit d6ae182
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 2 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl

## 2023-09 (1)
- [RetNet: Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/abs/2307.08621) [[Notes](paper_notes/retnet.md)] [MSRA]
- [Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/abs/2006.16236) [[Notes](paper_notes/transformers_are_rnns.md)] <kbd>ICML 2020</kbd>
- [Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/abs/2006.16236) [[Notes](paper_notes/transformers_are_rnns.md)] <kbd>ICML 2020</kbd> [Linear attention]
- [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- [An Attention Free Transformer](https://arxiv.org/abs/2105.14103) [[Notes](paper_notes/aft.md)] [Apple]
- [AFT: An Attention Free Transformer](https://arxiv.org/abs/2105.14103) [[Notes](paper_notes/aft.md)] [Apple]
- [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)
- [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
- [Speeding up the GPT - KV cache](https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/)
Expand Down Expand Up @@ -71,6 +71,7 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
- [Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation]()
- [VIMA: General Robot Manipulation with Multimodal Prompts]()
- [An Attention Free Transformer](https://arxiv.org/abs/2105.14103) [Apple]
- [PDDL Planning with Pretrained Large Language Models]() [MIT Leslie Kaelbling]

## 2023-08 (3)
- [RT-1: Robotics Transformer for Real-World Control at Scale](https://arxiv.org/abs/2212.06817) [[Notes](paper_notes/rt1.md)] [DeepMind]
Expand Down
42 changes: 42 additions & 0 deletions paper_notes/aft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# [An Attention Free Transformer](https://arxiv.org/abs/2105.14103)

_September 2023_

tl;dr: A new mechanism to replace dot-product attention, by introducing a learned pair-wise position bias. No attention map!

#### Overall impression
Conventional scaled dot-product attention mechanism has quadratic time and space complexity wrt the context size. Many previous work (such as linear attention in [Transformers are RNNs](transformers_are_rnns.md)) try to approximate full attention operation.

In AFT, K and V (context) are first combined together with a set of **learned position biases**. This step generates a reduced context, akin to the compression of dictionary. The lookup of query in this dictionary is then performed by element wise multiplication.

AFT maintains direct interaction between any two points in the context, a major advantage of dot product attention.

#### Key ideas
- AFT is a plugin replacement of MHA.
$$
Y_t = \sigma_q(Q_t) \odot \frac{\sum_{t'=1}^T \exp(K_{t'} + w_{t, t'}) \odot V_{t'}}{\sum_{t'=1}^T \exp(K_{t'} + w_{t, t'})}
$$
- $\sigma_q$ is sigmoid. $\odot$ is elementwise product. $w \in R^{T \times T}$ is the learned position bias.
- w has no channels and is only positional dependet.
- For each target position t, AFT performs a weighted average of values (element wise weighted), the result of which is combined with the query with elementwise mult.
- The weighting is simply keys and a set of learned positional biases.
- Attention map or attention matrix: rowwise softmax(QK^T), with TxT format.
- Attention map gives the elementwise conenctivity and is computationally heavy O(T^2 d). It signifies for a given element in T seq, how much attention should it give to the weight in the weighted sum of V to generate the final result.
- AFT eliminate the need of attention map.
- AFT-local: masked w as a [band matrix](https://en.wikipedia.org/wiki/Band_matrix), non-zero entries are confined to a diagonal band, comprising the main diagonal and zero or more diagonals on either side. When s=1, w is a tridiagnal matrix.
- AFT-simple: when s=0, no w is learned.
$$
Y_t = \sigma_q(Q_t) \odot \frac{\sum_{t'=1}^T \exp(K_{t'} ) \odot V_{t'}}{\sum_{t'=1}^T \exp(K_{t'} )} \\
= \sigma_q(Q_t) \odot \sum_{t'=1}^T(softmax(K) \odot V)_{t'}
$$
- The context reduction is simpolied to elemetwise operation and global pooling.
- There is no global connectivity
- AFT-conv: when $w_{t, t'}$ is only dependent on the relative position of t and t', f(t-t').
- When s=0, no positional bias. --> Global connectivity is lost. The paper does not state this clearly, but I think so.

#### Technical details
- The rearranged computational ordering of QKV is also found in [linearized attention](transformers_are_rnns.md) works. The difference is that AFT uses elementwise attention, while all linearized attention still use dot-product. This further reduces compuatational complexity from O(Td^2) to O(Td). --> This is not too much as typically T>>d.

#### Notes
- What keeps the global connectivity if the leanred positonal bias W=0?
- AFT can be viewed as performning attention where the number of attention heads is the same as the model's feature dimension. --> Why? I did not get it.

0 comments on commit d6ae182

Please sign in to comment.