Transformers in Vision - (1) Attention & Transformer

Transformer가 computer vision에서 어떻게 쓰였는지, 관련 모델이 어떻게 발전하고 있는지 여러 포스팅에 걸쳐서 알아보고자 한다.

이번 포스팅에서는 가장 중요한 기초 내용인 attention과 transformer에 대해 알아볼 것이다.

Transformer는 흔히 딥러닝의 기초라고 말하는 CNN, RNN 등과 같이 neural network 중 하나인데, 조금 다른 형태를 갖는다. 이 구조와 동작 원리를 이해하기 위해 필요한 개념이 attention이다.

Introduction

우선 대표적인 neural network인 CNN, GNN, RNN의 inductive bias를 정리해보자. Inductive bias란, 딥러닝 모델이 본 적 없는 데이터에 대해서도 좋은 성능을 보이도록(일반화 성능이 좋도록) 하는 추가적인 가정(additional assumption)의 집합이다. 좀 더 쉽게 말하자면, 모델이 본 적 없는 데이터를 입력받았을 때 그 데이터에 대한 예측을 하기 위해 갖고 있는, 학습 과정에서 습득하는 것이 아닌 아키텍쳐(CNN, RNN, MLP 등)가 본질적으로 갖고 있는 특성이다.

Inductive bias에는 크게 relational inductive bias와 non-relational inductive bias가 있는데, 일반적인 inductive bias는 relational inductive bias를 말한다. Relational inductive bias는 입력 element와 출력 element 간의 관계에 대한 inudctive bias이다.

이전에 알아본 대표적인 neural network들의 inductive bias는 아래와 같다.

Convolutional Neural Networks (CNN)
- Locality Principle : Image 전체가 아닌 kernel(filter) size 만큼의 patch(일부)만 고려한다.
- Spatial Invariance : Image 내에서 object의 위치가 변해도(ex. translation) 결과는 바뀌지 않는다. (Translation equivalence)
Graph Neural Networks (GNN)
- Permutation Invariance : Node의 순서가 바뀌어도 output(node embedding, graph embedding 등)은 그대로이다.
Reccurent Neural Network (RNN)
- Sequentiality : 순서(element)가 바뀌었을 때, output이 바뀌는 것을 고려한다. 예를 들어, 단어의 관점에서 단어의 순서가 바뀌었을 때 문장의 의미가 바뀐다.
- Temporal Invariance : 순서(이 때에는 sequence data의 index를 말함)가 바뀌어도 output이 바뀌지 않는다. 예를 들어, 문장의 관점에서, 문장의 위치가 바뀌어도 문장의 의미는 변하지 않는다.

예를 들어, 학습 데이터가 이미지 데이터인 경우, CNN을 활용하게 되면 spatial invariance 덕분에 새로운 데이터에서는 object가 어떤 위치에 있던 간에 detection을 수행할 수 있게 된다.

Enoder-Decoder Architecture

Sequence-to-sequence problem(예를 들어 기계 번역)을 다룰 때, input data와 output data의 length는 정해져있지 않다.

이러한 데이터를 다루는 데 효율적인 것은 encoder-decoder 아키텍쳐이다.

Encoder : 다양한 length의 sequence를 입력으로 받아 어떤 state를 출력한다.
Decoder : encoding된 결과인 state와 target sequence의 context를 입력으로 받아 subsequent result(target sequence)를 예측(생성)한다.

위에서 state는 input의 정보를 압축한(compressed information) 개념으로 볼 수 있다. 즉, encoder는 data compression을 수행한다고 볼 수 있다. 이렇게 state가 포착하는 information을 feature information이라 하고, encoding 과정은 아래와 같이 수식으로 나타낼 수 있다.

\( \underset{\theta}{\min} \lVert \mathbf{x} - f_\theta (\mathbf{x}) \rVert \)

또한, decoder는 state에 noise를 추가하여 subsequent result를 출력하는데, 이 최종 결과를 주어진 sequence와 비교하여 그 차이를 줄이는 것이 encoder-decoder architecture의 목적이다. 이렇게 최종 결과와 입력의 차이를 줄여나가는 과정을 denoising이라 하고, 아래 수식으로 표현한다.

\( \underset{\theta}{\min} \lVert \mathbf{x} - f_\theta (\mathbf{x} + \text{noise} ) \rVert \)

예를 들어, input이 "My name is Sangjune Park."(영어)를 "제 이름은 박상준입니다."(한국어)로 해석하는 machine translation task에서 encoder-decoder architecture는 아래와 같은 과정으로 동작한다.

Encoder : 입력 문장을 state(feature)로 인코딩한다. 여기서는 state가 semantic information을 포착한다.
Decoder : Feature를 사용하여 해석된 문장을 생성한다.

RNN-based Encoder-Decoder

좀 더 자세히 알아보자. RNN(LSTM, GRU 포함) 기반의 encoder-decoder 아키텍쳐를 살펴보면, encoder에서 모든 time step의 input sequence를 입력으로 받아 hidden state로 변형한다.

\( \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1} ) \)
\( \mathbf{c} = q(\mathbf{h}_1, \cdots, \mathbf{h}_T ) \)

\(q\) : (모든 time step의) hidden state를 context로 mapping하는 함수

그리고, decoder에서는 이전 hidden state와 context를 입력으로 받아 새로운 state를 만든다.

\( \mathbf{s}_\tau = g (y_{\tau - 1}, \mathbf{c}, \mathbf{s}_{\tau - 1} ) \)

말로 설명하면 같은 state를 말하는 듯 하지만, encoder에서의 state와 decoder에서의 state는 다르다. (h와 s가 같은 것이 바로 RNN이다!) 위 그림에서와 같이 encoder에서 생성한 hidden state 정보를 포함하는 'context'로 decoder의 state를 update하는 방식이다.

여기서의 context는 모든 time step에 대한 state를 똑같이 고려했지만, transformer에서는 주의를 기울일 만한 hidden state에 더 weight를 주어(attention 개념)서 context를 형성한다.

Self-attention and Transformer

Attention

일반적으로, 사람은 주어진 이미지에서 object가 어떤 class인지 예측할 때, 특정 부위를 통해 특징을 구별하고, class를 예측한다. 예를 들어, 다음 사진에 나온 동물이 어떤 동물인지 맞혀보자.

사람은 위 사진의 빨간 box와 같이, '큰 눈', '뾰족한 귀', '동그란 발' 등의 특정 부위에 더 주의(attention)를 기울인다(prioritize). 이를 통해 효과적으로 object를 인식하는 것이다.

이를 모델링한 것이 바로 attention module이다.

Attention은 input에 대한 'pooling'(with bias alignment)으로 설계가 가능하다.

Attention layer에 입력으로 들어오는 요소들은 다음과 같다.

Query : volitional cue \( \mathbf{q} \in \mathbb{R}^{d_q} \)
- 현재 (관심 있는) element의 representation
Keys : list of cues \(\mathbf{k} \in \mathbb{R}^{d_k} \)
- Context(input sequence) 안에 있는 element들의 representation (Element 각각이 query와 얼마나 관련있는지 구하기 위한 값)
Values : Feature representation \( \mathbf{v} \in \mathbb{R}^{d_v} \)
- Context(input sequence)에 있는 element들의 representation (Element 각각을 나타내는 값으로, attention layer의 결과를 구하기 위해 최종적으로 weighted sum을 하는데, 이 때 weight가 query와의 관련성을 나타내는 attention score)

즉 attention mechanism은 주어진 query에 대해 attention pooling(attention weight를 사용한 pooling)을 통해 representation feature 중에서 bias selection을 한다. 이 과정을 수식으로 나타내면 다음과 같다.

\( f \left( \mathbf{q}, \left\{ (\mathbf{k}_i, \mathbf{v}_i \right\}_{i=1}^m \right) = \sum\limits_{i=1}^m \alpha (\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i \)
\( \alpha (\mathbf{q}, \mathbf{k}_i) = \text{softmax} \left( a(\mathbf{q}, \mathbf{k}_i) \right) = \cfrac{\text{exp}\left( a(\mathbf{q}, \mathbf{k}_i) \right)}{\sum\limits_{j=1}^m \text{exp} \left( a(\mathbf{q}, \mathbf{k}_j) \right)} \)

\(a\) : attention scoring function으로, query \(\mathbf{q}\)가 i번째 key \(\mathbf{k}_i\)와 얼마나 비슷한지를 나타내는 weight이다. Attention scoring function에 활용되는 함수는 아래와 같다.
- Additive pooling
- Scaled dot product
\(\alpha\) : attention weight로, 총 m개의 key 중에서 i번째 key와 얼마나 비슷한지(관련 있는지)를 0~1 사이의 값으로 나타낸다.

즉, query와 i번째 key가 얼마나 비슷한지 attention weight를 구하고, i번째 value를 곱한 값들을 모두 더하여 output을 구하는 것이 일반적인 attention mechanism이다.

Attention scoring function (\(a\))에 활용하는 함수를 좀 더 자세히 살펴보자.

Additive Pooling ( \(d_k \neq d_q\)일 때 사용)

\( \begin{align*} a(\mathbf{q}, \mathbf{k}) &= \left\langle \mathbf{w}, \text{tanh}(\mathbf{W}_q \mathbf{q} + \mathbf{W}_k \mathbf{k} ) \right\rangle \\ &= \left\langle \mathbf{w}, \text{tanh}([\mathbf{W}_q, \mathbf{W}_k] [\mathbf{q}, \mathbf{k}]) \right\rangle \end{align*} \)

\( \mathbf{w} \in \mathbb{R}^{d_h} \)
\( \mathbf{W}_q \in \mathbb{R}^{d_h \times d_q} \)
\( \mathbf{W}_k \in \mathbb{R}^{d_h \times d_k} \)
대괄호 (\([]\))는 concatenation을 뜻한다.

이 attention scoring function은 LSTM의 memory gate와 비슷한 형태를 갖고 있다. 여기서 \(a\)는 learnable attention weight이다.

Scaled Dot-product (\(d_k = d_q = d\)인 경우 사용 가능)

\( a(\mathbf{q}, \mathbf{k}) = \cfrac{\left\langle \mathbf{q}, \mathbf{k} \right\rangle}{\sqrt{d}} \)

Key와 query의 차원이 같은 경우에는 단순히 그 차원으로 normalize한 dot product(inner product)를 사용할 수 있다. 여기서는 a가 learnable parameter가 아니고, \(\mathbf{q, k}\)가 정해지면 값이 정해진다.(deterministic)

Bahdanau Attention and Self-attention

위에서 알아본 RNN 기반 encoder-decoder 아키텍쳐에서는 decoder의 각 step에서 모든 input을 인코딩한 context \(\mathbf{c}\)를 사용했다. 이것을 attention 식에 대입해보면, context는 다음과 같이 구한다.

\( \mathbf{c}_\tau = \sum\limits_{t=1}^T \alpha (\mathbf{s}_{\tau - 1}, \mathbf{h}_t) \mathbf{h}_t \)

Attention mechanism에서의 식과 비교해보면, \(\mathbf{s}_{\tau - 1}\)이 query, \(\mathbf{h}_t\)는 key이자 value라는 것을 알 수 있다. 이 방법은 decoder의 계산 과정에 attention을 적용한 것이며, encoder에는 attention을 사용하지 않았다.

Encoder에 attention을 적용하기 위해서는 self-attention 개념을 알아야 한다.

Self-attention이란, attention mechanism에서의 query, key, value 모두에 input sequence \(\mathbf{x}\)를 적용하는 것이다.

\( f \left( \mathbf{q}, \left\{ ( \mathbf{k}_i, \mathbf{v}_i ) \right\}_{i=1}^m \right) = \sum\limits_{i=1}^m \alpha (\mathbf{q}, \mathbf{k}_i)\mathbf{v}_i \)
\( \rightarrow \; f \left( \mathbf{x}, \left\{ ( \mathbf{x}_i, \mathbf{x}_i ) \right\}_{i=1}^n \right) = \sum\limits_{i=1}^n \alpha (\mathbf{x}, \mathbf{x}_i)\mathbf{x}_i \)

그리고, 이러한 self-attention을 multi-head로, 즉 서로 다른 m개의 weight들에 대해 진행하는 것을 multi-head attention (MHA)라 한다. 이 과정은 각 head에 대해서 병렬 계산이 가능하다.

\( \mathbf{h}_m = f \left( \mathbf{W}_m^{(q)} \mathbf{x}, \left\{ \mathbf{W}_m^{(k)} \mathbf{x}_i, \mathbf{W}_m^{(v)} \mathbf{x}_i \right\}_{i=1}^n \right) = \sum\limits_{i=1}^n \alpha \left( \mathbf{W}_m^{(q)} \mathbf{x}, \mathbf{W}_m^{(k)} \mathbf{x}_i \right) \mathbf{W}_m^{(v)} \mathbf{x}_i \)

Comparing CNN, RNN, and Self-attention

CNN, RNN, Self-attention을 computational complexity(CC), sequential operations(SO), maximum path lengths(MPL)에 대해 비교해보자. 각 metric은 다음과 같은 개념을 갖는다.

CC : 계산량
SO : 높을수록 parallel computation에 부적합함. 즉, 낮을수록 GPU 계산에 적합함
MPL : 높을수록 long-range dependency를 낮춤 (RNN의 이러한 단접을 보완한 것이 LSTM, GRU)

Self-attention은 병렬 계산과 sequential modeling에 최적화되어있는 반면, input sequence가 길어짐에 따라 계산 속도가 확연히 느려진다는 단점이 있다.

Positional Encoding

Self-attention은 permutation invariant 성질을 갖는다. 왜냐하면 attention weight를 구하는 과정에서 softmax를 사용하는데, softmax는 순서가 바뀌어도 결과가 같기 때문이다.

이에 따라 sequence의 순서 정보(sequentiality)를 줄 수가 없는데, 이를 해결하기 위해 positional encoding을 사용한다.

Positional encoding이란, self-attention에 위치(순서) 정보를 주기 위한 개념이다.

입력 sequence \(\mathbf{x}_i\)에 다음과 같은 positional encoding \(\mathbf{P}\)를 적용한다. (아래는 'attention is all you need' 논문에서 제안한 방법이고, task에 따라 다른 방법을 적용할 수도 있다.)

\( \mathbf{x}_i \rightarrow \mathbf{P}(\mathbf{x}_i) = \begin{cases} x_j + \sin \left( \cfrac{i}{10000^{j/d}} \right) & \quad \text{where } j \text{ is even} \\ x_j + \cos \left( \cfrac{i}{10000^{(j-1)/d}} \right) & \quad \text{where } j \text{ is odd} \end{cases} \)

Transformer

결국 Transformer는 RNN 기반 encoder, decoder를 전혀 사용하지 않고, attention module을 encoder에도 적용한 아키텍쳐이다.

Transformer에는 크게 5가지 구성요소가 있다.

1. Positional Encoding

Positional encoding은 Transformer의 부족한 inductive bias를 보완해주는 역할을 한다. Transformer는 input sequence의 정확한 위치(RNN의 time step)를 알 수가 없는데, 이를 positional encoding을 통해 제공한다.

2. Multi-head Self-attention

Decoder에서 첫 multi-head attention layer는 masked multi-head attention을 이용한다. 이는 transformer의 경우 전체 time step 모두를 입력으로 받기 때문에, self-attention score 값을 행렬로 계산할 때 query의 시점보다 미래의 key 값 또한 곱해져 계산이 되기 때문에(행렬의 대각선 윗부분), 이 부분을 사용하면 미래를 이미 알고있는 것이 되어버린다. 이를 막기 위해 아래와 같이 행렬의 대각선 윗 부분 계산 값을 무시하는 과정(making)을 거친다.

Masked self-attention 계산 과정, 출처 : https://aimb.tistory.com/182

Masking example, 출처 : https://wikidocs.net/156986

3. Multi-head Cross-attention

Cross-attention이란, 서로 다른 두 개의 sequence를 다루는 attention module이다. 즉, 위 그림처럼 key와 value의 sequence와 query의 sequence가 다를 때의 attention 모듈이다.

4. Residual Connection + Layer Normalization

5. Positionwise FeedForward Network

각 component에 대한 자세한 설명은 Attention is all you need 리뷰 내용을 참조하자.

저작자표시 비영리 변경금지 (새창열림)

Transformers in Vision - (1) Attention & Transformer

Introduction