자연어, 비전

transformer 구현 및 설명

H_erb Salt 2021. 6. 3. 17:45

transformer에 관한 구현 및 설명을 정리한 것이다.

원문은 wikidocs에 있는 설명 자료를 거의 그대로 카피한 것이고, 공부할 겸, 필사하는 느낌으로 정리했다.

transformer 설명 자료 원문 주소

transformer

transformer 설명자료¶

transformer 논문: https://arxiv.org/abs/1706.03762
transformer 설명자료: https://www.youtube.com/watch?v=Yk1tV_cXMMU (반드시 볼 것)

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

1. transformer의 하이퍼 파라미터 정리¶

$d_{model}$ = 512
- 트랜스포머의 인코더와 디코더에서 정해진 입력과 출력의 크기. 임베딩 벡터의 차원 또한 $d_{model}$ 이며, 각 인코더와 디코더가 다음 층의 인코더와 디코더로 값을 보낼 때에도 이 차원을 유지함. 논문에선 512
num_layers = 6
- 트랜스포머에서 하나의 인코더와 디코더를 층으로 생각하였을 때, 인코더와 디코더의 층 갯수. 논문에선 6
num_heads = 8
- 어텐션을 사용할 때 분할하는 Multi Head Attention 수
$d_{ff}$ = 2048
- 트랜스포머 내부에는 Feed Forward Network가 존재함. 이 때, 은닉층의 크기를 의미함. Feed forward Network의 입력층과 출력층의 크기는 $d_{model}$

In [2]:

tf.range(3, dtype=tf.float32).shape

Out[2]:

TensorShape([3])

In [3]:

tf.range(3, dtype=tf.float32)[:, tf.newaxis]

Out[3]:

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[0.],
       [1.],
       [2.]], dtype=float32)>

In [4]:

tf.range(3, dtype=tf.float32)[tf.newaxis, :]

Out[4]:

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0., 1., 2.]], dtype=float32)>

In [5]:

[i for i in range(10)][::2]

Out[5]:

[0, 2, 4, 6, 8]

2. Positional Encoding¶

Sequential한 데이터들의 위치 정보를 반영하기 위한 과정

$PE_{pos,2i} = sin(pos/10000^{2i~/~d_{model}})$
$PE_{pos,2i+1} = cos(pos/10000^{2i~/~d_{model}})$

In [6]:

class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, position, d_model):
        super(PositionalEncoding, self).__init__()
        self.pos_encoding = self.positional_encoding(position, d_model)
        
    def get_angles(self, position, i, d_model):
        angles = 1 / tf.pow(10000, (2*(i // 2)) / tf.cast(d_model, tf.float32))
        return position * angles
    
    def positional_encoding(self, position, d_model):
        angles_rads = self.get_angles(position = tf.range(position, dtype=tf.float32)[:, tf.newaxis],
                                     i = tf.range(d_model, dtype=tf.float32)[tf.newaxis, :],
                                     d_model = d_model)
        
        # 배열의 짝수 인덱스(2i)에는 사인 함수 적용
        sines = tf.math.sin(angles_rads[:, 0::2])
        
        # 배열의 홀수 인덱스(2i+1)에는 코사인 함수 적용
        cosines = tf.math.cos(angles_rads[:, 1::2])
        
        angles_rads = np.zeros(angles_rads.shape)
        angles_rads[:, 0::2] = sines
        angles_rads[:, 1::2] = cosines
        pos_encoding = tf.constant(angles_rads)
        
        ''' ★★★★★★ '''
#         print(pos_encoding) # 변경 전
        pos_encoding = pos_encoding[tf.newaxis, ...] # tensor에 하나의 차원이 추가됨
#         print('\n', pos_encoding, tf.float32) # 변경 후
        
        return tf.cast(pos_encoding, tf.float32)
    
    def call(self, inputs):
        return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]

In [7]:

# 문장의 길이 500, 임베딩 벡터의 차원 512
sample_pos_encoding = PositionalEncoding(500, 512)

plt.pcolormesh(sample_pos_encoding.pos_encoding.numpy()[0], cmap='RdBu')
plt.xlabel('Depth')
plt.xlim((0, 128))
plt.ylabel('Position')
plt.colorbar()
plt.show()

3. Attention¶

transformer에서 사용하는 3가지 어텐션
- Encoder Self-Attention: 인코더에서 이루어지는 어텐션 (Query = Key = Value)
- Masked Decoder Self-Attention: 디코더 (Query = Key = Value)
- Encoder-Decoder Attention: 디코더 (Query: Decoder Vector / Key = Value: Encoder Vector)

Self-Attention: Query, Key, Value가 동일한 경우를 뜻함. 주의할 점은 벡터의 값이 같다는 것이 아니라 벡터의 출처가 같다는 의미
transformer는 num_layer 개수의 인코더 층을 쌓음. 논문에서는 6개의 인코더 층 사용. 인코더를 하나의 층이라는 개념으로 생각한다면, 하나의 인코더 층은 2개의 서브층(sublayer)로 나뉘어짐. 바로 self-attention과 feed forward neural network.
Position-wise FFNN은 우리가 알고있는 일반적인 Feed forward 신경망

4. Encoder Self-Attention¶

1) Self-Attention 의미와 이점¶

Attention 함수는 Query에 대해서 Key와의 유사도를 각각 구함. 그리고 구해낸 유사도를 가중치로하여 Key와 매핑되어있는 각각의 Value를 반영해줌. 그리고 유사도가 반영된 Value를 가중합하여 리턴
Self-Attention은 attention을 자기자신에게 수행한다는 의미
Self-Attention에서의 Q, K, V
- Q: 입력 문장의 모든 단어 벡터들
- K: 입력 문장의 모든 단어 벡터들
- V: 입력 문장의 모든 단어 벡터들

2) Q, K, V vector 얻기¶

Self-Attention은 입력 문장의 단어 벡터들을 가지고 수행한다고 했는데, 사실 Self-Attention은 인코더의 초기 입력인 $d_{model}$의 차원을 가지는 단어 벡터들을 사용하여 셀프 어텐션을 수행하는 것이 아니라 우선 각 단어 벡터들로부터 Q, K, V 벡터를 얻는 작업을 거침
- 이 때, 이 Q, K, V 벡터들은 초기 입력인 $d_{model}$의 차원을 가지는 단어 벡터들보다 더 작은 차원을 가지는데, 논문에서는 $d_{model}$을 num_heads로 나눈 값을 각 Q, K, V벡터의 차원으로 결정함.

기존의 벡터로부터 더 작은 벡터 가중치 행렬을 곱하므로써 완성됨. 각 가중치 행렬은 $d_{model}$ X ($d_{model}$ / num_heads)의 크기를 가짐. 이 가중치 행렬은 훈련 과정에서 학습됨.
즉, 논문과 같이 $d_{model}$= 512고 num_heads= 8이라면, 각 벡터에 3개의 서로 다른 가중치 행렬은 곱하고 64의 크기를 가지는 Q, K, V 벡터를 얻어냄. 위의 그림은 단어 벡터 중 student 벡터로부터 Q, K, V 벡터를 얻어내는 모습을 보여줌. 모든 단어 벡터에 위와 같은 과정을 거치면 input 단어들은 각각의 Q, K, V 벡터를 얻음

3) Scaled dot-product Attention¶

Q, K, V vector를 얻었다면 지금부터는 기존에 배운 어텐션 메커니즘과 동일함. 각 Q 벡터는 모든 K 벡터에 대해서 Attention score를 구하고, Attention 분포를 구한 뒤에 이를 사용하여 모든 V vector를 가중합하여 attention 값 또는 context vector를 구하게 됨. 그리고 이를 모든 Q 벡터에 대해서 반복함

Attention 함수의 종류는 다양함. transformer 에서는 내적만을 사용하는 어텐션 함수 $score(q,k)=q \bullet k$가 아니라 여기에 특정 값으로 나눠준 어텐션 함수인 $socre(q, k) =q \bullet k/\sqrt n$을 사용함.
이러한 함수를 사용하는 어텐션을 dot-product attention에서 값을 스케일링하는 것을 추가하였다고 하여 Scaled dot-product Attention이라고 함.

128과 32는 임의로 정한 숫자이므로 신경X
위의 그림에서의 attention score는 i 번째 단어가 나머지 단어와 얼마나 연관되어 있는지를 보여주는 수치.
transformer 에서는 두 벡터의 내적값을 스케일링하는 값으로 K 벡터의 차원을 나타내는 $d_k$에 루트를 씌운 $\sqrt {d_k}$를 사용하는 것을 택함
논문에서 $d_k$는 $d_{model}$ / num_heads라는 식에 따라서 64의 값을 가지므로 $\sqrt {d_k}$는 8의 값을 가짐

이제 attention score에 softmax 함수를 사용하여 Attention-Distribution을 구하고, 각 V 벡터와 가중합하여 어텐션 값(Attention Value)을 구함.
이를 i 번째 단어에 대한 어텐션 값 또는 i 번째 단어에 대한 컨텍스트 벡터(Context vector)라고도 부름

4) 행렬 연산으로 일괄처리¶

이를 각각 처리하지 않고 행렬연산으로 일괄적으로 처리

우선, 각 단어 벡터마다 일일이 가중치 행렬을 곱하는 것이 아니라 문장 행렬에 가중치 행렬을 곱하여 Q matrix, K matrix, V matrix를 구함

Q 행렬을 K 행렬을 전치한 행렬과 곱해준다고 하면, 각각의 단어의 Q벡터와 K벡터의 내적이 되는 각 행렬의 원소가 되는 행렬이 결과로 나옴
다시말해, 위의 그림의 결과 행렬의 값이 전체적으로 $\sqrt {d_k}$를 나누어주면 이는 각 행과 열이 attention score값을 가지는 행렬이 됨.
예를 들어, I 행과 student 열의 값은 I의 Q 벡터와 student의 K 벡터의 Attention score와 동일한 행렬이 된다는 것. 즉, Attention score 행렬
Attention score를 구하였다면 남은 것은 Attention distribution을 구하고, 이를 사용하여 모든 단어에 대한 어텐션 값을 구하는 일
이를 간단하게 Attention score 행렬에 softmax 함수를 사용하고, V 행렬을 곱하는 것으로 해결됨.
이렇게 되면 각 단어의 어텐션 값을 모두 가지는 어텐션 값 행렬이 결과료 표시됨

위의 그림은 행렬 연산을 통해 모든 값이 일괄 계산되는 과정을 식으로 보여줌
실제 transformer 논문에 기재된 아래의 수식과 정확히 일치하는 식 $$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$
위의 행렬 연산에 사용된 행렬의 크기를 모두 정리함. 우선 문장의 길이를 seq_len이라고 함. 그렇다면 문장 행렬의 크기는 (seqlen, $d{model}$)임
여기에 3개의 가중치 행렬을 곱해서 Q, K, V matrix를 만들어야 함

우선, 행렬의 크기를 정의하기 위해 행렬의 각 행에 해당되는
- Q vector와 K vector의 크기: $d_k$라고 하고,
- V vector의 크기: $d_v$라고 함.
그렇다면 Q 행렬과 K 행렬의 크기는 (seq_len, $d_k$)이며, V 행렬의 크기는 (seq_len, $d_v$)가 되어야 함. 여기서 문장 행렬과 Q, K, V 행렬의 크기로부터 가중치 크기 행렬의 크기 추정이 가능함
$W^Q$와 $W^K$는 ($d_{model}, d_k$)의 크기를 가지며, $W^V$는 ($d_{model}, d_v$)의 크기를 가짐
단, 논문에서는 $d_k$와 $d_v$의 크기는 $d_{model}$ / num_heads와 같음. 즉, $d_{model}$ / num_heads = $d_k$ = $d_v$

결과적으로, $softmax(\frac{QK^T}{\sqrt{d_k}})V$ 식을 적용하여 나오는 어텐션 값 행렬 a의 크기는 (seq_len, $d_v$)가 됨

5) Scaeld-Dot Product Attention 구현¶

In [8]:

def scaled_dot_product_attention(query, key, value, mask):
    # query 크기: (batch_size, num_heads, query의 문장 길이, d_model/num_heads)
    # key 크기: (batct_size, num_heads, key의 문장 길이, d_model/num_heads)
    # value 크기: (batch_size, num_heads, value의 문장 길이, d_model/num_heads)
    # padding_mask: (batch_size, 1, 1, key의 문장 길이)
    
    # Q와 K의 곱. Attention score 행렬,
    matmul_qk = tf.matmul(query, key, transpose_b=True)
    
    # scaling
    # dk의 루트 값으로 나눔
    depth = tf.cast(tf.shape(key)[-1], tf.float32)
    logis = matmul_qk / tf.math.sqrt(depth)
    
    # 마스킹. Attention score matrix의 마스킹 할 위치에 매우 작은 음수값을 넣음
    # 매우 작은 값이므로 softmax 함수를 지나면 행렬의 해당 위치 값은 0이 됨
    if mask is not None:
        logis += (mask * -1e9)
        
    # softmax 함수는 마지막 차원인 key의 문장 길이 방향으로 수행됨
    # attention weight: (batch_size, num_heads, query의 문장 길이, key의 문장 길이)
    attention_weights = tf.nn.softmax(logis, axis=-1)
    
    # output: (batch_size, num_heads, query의 문장 길이, d_model / num_heads)
    output = tf.matmul(attention_weights, value)
    
    return output, attention_weights

Q 행렬과 $K^T$를 곱하고, softmax 함수를 사용하여 attention distribution 행렬을 얻은 뒤에 V 행렬과 곱함
mask가 사용되는 if문은 후술할 내용으로 뒤에 설명함

scaled_dot_product_attention 함수가 정상 작동하는지 테스트. temp_q, temp_k, temp_v 라는 임의의 Q, K, V 행렬을 만들고 이를 함수에 대입

In [9]:

# 임의의 Q, K, V 행렬 생성
np.set_printoptions(suppress=True)

temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)

temp_k = tf.constant([[10, 0, 0],
                     [0, 10, 0],
                     [0, 0, 10],
                     [0, 0, 10]], dtype=tf.float32) # (4, 3)

temp_v = tf.constant([[1, 0],
                     [10, 0],
                     [100, 5],
                     [1000, 6]], dtype=tf.float32) # (4, 2)

In [10]:

print(temp_q.shape, temp_k.shape, temp_v.shape)

(1, 3) (4, 3) (4, 2)

In [11]:

temp_out, temp_attn = scaled_dot_product_attention(temp_q, temp_k, temp_v, None)
print('attention 분포: ', temp_attn)
print('attention 값: ', temp_out)

attention 분포:  tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32)
attention 값:  tf.Tensor([[10.  0.]], shape=(1, 2), dtype=float32)

Query는 4개의 Key 값 중 두번째 값과 일치하므로 어텐션 분포는 [0, 1, 0, 0]의 값을 가지며, 결과적으로 Value의 두번째 값인 [10, 0]이 출력됨
이번에는 Query의 값만 다른 값으로 바꿔보고 함수를 실행함. 이번에 사용할 Query 값은 [0, 0, 10]은 key의 세번째 값과, 네번째 값 두개의 값 모두와 일치하는 값

In [12]:

temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32)
temp_out, temp_attn = scaled_dot_product_attention(temp_q, temp_k, temp_v, None)
print('attention 분포: ', temp_attn)
print('attention 값: ', temp_out)

attention 분포:  tf.Tensor([[0.  0.  0.5 0.5]], shape=(1, 4), dtype=float32)
attention 값:  tf.Tensor([[550.    5.5]], shape=(1, 2), dtype=float32)

Query의 값은 Key의 세번째 값과 네번째 값 두개의 값과 모두 유사하다는 의미에서 Attention Distribution은 [0, 0, 0.5, 0.5]의 값을 가짐
결과적으로 나오는 값 [550, 5.5]는 Value의 세번째 값 [100, 5]에 0.5를 곱한 값과 네번째 값 [1000, 6]에 0.5를 곱한 값의 원소별 합
이번에는 하나가 아닌 3개의 Query의 값을 함수의 입력으로 사용

In [13]:

temp_q = tf.constant([[0, 0, 10], [0, 10, 0], [10, 10, 0]], dtype=tf.float32) # (3, 3)
temp_out, temp_attn = scaled_dot_product_attention(temp_q, temp_k, temp_v, None)
print('attention 분포: ', temp_attn)
print('attention 값: ', temp_out)

attention 분포:  tf.Tensor(
[[0.  0.  0.5 0.5]
 [0.  1.  0.  0. ]
 [0.5 0.5 0.  0. ]], shape=(3, 4), dtype=float32)
attention 값:  tf.Tensor(
[[550.    5.5]
 [ 10.    0. ]
 [  5.5   0. ]], shape=(3, 2), dtype=float32)

6) Multi-Head Attention¶

앞서 배운 어텐션에서는 $d_{model}$의 차원을 가진 벡터를 num_heads로 나눈 차원을 가지는 Q, K, V 벡터로 바꾸고 어텐션을 수행함.
논문 기준에서는 512의 차원의 각 단어 벡터를 8로 나누어 64차원의 Q, K, V 벡터로 바꾸어서 attention을 수행한 셈인데, 이제 numheads의 의미와 왜 $d{model}$의 차원을 가진 벡터를 가지고 어텐션을 하지 않고 차원을 축소시킨 벡터로 어텐션을 수행하였는지 이해

transformer 연구진은 한 번의 attention을 하는 것보다 여러 번의 attention을 병렬적으로 사용하는 것이 더 효과적이라고 판단함
그래서 $d_{model}$의 차원을 numheads개로 나누어 $d{model}$ / num_heads의 차원을 가지는 Q, K, V에 대해서 num_heads개의 병렬 어텐션을 수행함
논문에서는 하이퍼파라미터인 num_heads의 값을 8로 지정하였고, 8개의 병렬 어텐션이 이루어지게 됨
다시 말해 위의 과정에서의 어텐션이 8개의 병렬로 이루어지게 되는데, 이 때 각각의 어텐션 값 행렬을 attention-head라고 함. 이 때 가중치 행렬이 되는 $W^Q, W^K, W^V$의 값은 8개 어텐션 헤드마다 모두 다름
이렇게 나눠진 어텐션은 다른 시각에서 정보들을 수집하겠다는 의미가 됨
이를 하나로 concat하면 attention-head 행렬의 크기는 (seqlen, $d{model}$)이 됨

attention-head를 모두 연결한 행렬은 또 다른 가중치 행렬 $W^0$를 곱하게 되는데, 이렇게 나온 결과 행렬이 Multi-Head Attention의 최종 결과물
위의 그림은 attention-head를 모두 연결한 행렬이 가중치 행렬 $W^0$와 곱해지는 과정을 보여줌.
이 때 결과물인 Multi-Head Attention 행렬은 Encoder의 입력이었던 문장 행렬의 (seqlen, $d{model}$) 크기와 동일함

다시 말해, Encoder의 첫 번째 sublayer 층인 Multi-Head Attention 단계를 끝마쳤을 때, Encoder의 input으로 들어왔던 행렬의 크기가 아직 유지되고 있는 것
첫 번째 서브층인 Multi-Head Attention과 두 번째 서브층인 Position-Wise FFNN을 지나면서 Encoder의 입력으로 들어올 때의 행렬의 크기는 계속 유지되어야 함
transformer는 다수의 인코더를 쌓은 형태인데(논문에서는 6개) 인코더에서의 입력의 크기가 출력에서도 동일 크기로 계속 유지되어야만 다음 인코더에서도 다시 입력이 될 수 있음

7) Multi-head Attention 구현¶

Multi-head attention에는 크게 두 종류의 가중치 행렬이 나옴, Q, K, V 행렬을 만들기 위한 가중치 행렬인 $W^Q, W^K, W^V$와 attention head들을 concat한 후에 곱해주는 $W^0$ 행렬
가중치 행렬을 곱하는 것은 구현 상에서는 입력을 밀집층(Dense layer)를 지나게 하므로서 구현함. keras 코드 상으로 사용한 Dense()에 해당됨

Multi-Head Attention은 크게 5가지 파트로 구성됨
1. $W^Q, W^K ,W^V$에 해당하는 $d_{model}$의 크기의 밀집층(Dense layer)을 지나게 함
2. 지정된 헤드 수(num_heads)만큼 나눔(split)
3. scaled-dot product attention
4. 나눠진 헤드들을 concatenatetion
5. $W^0$에 해당하는 밀집층을 지나게 함

In [14]:

from tensorflow.keras.layers import Dense, Layer

In [15]:

class MultiHeadAttention(Layer):
    
    def __init__(self, d_model, num_heads, name='multi_head_attention'):
        super(MultiHeadAttention, self).__init__(name=name)
        self.num_heads = num_heads
        self.d_model = d_model
        
        assert d_model % self.num_heads == 0
        
        # d_model을 num_heads 로 나눈 값.
        # 논문 기준: 64
        self.depth = d_model // self.num_heads
        
        # WQ, WK, WV에 해당하는 밀집층 정의
        self.query_dense = Dense(units=d_model)
        self.key_dense = Dense(units=d_model)
        self.value_dense = Dense(units=d_model)
        
        # W0에 해당하는 밀집층 정의
        self.dense = Dense(units=d_model)
        
    
    # num_heads 개수만큼 q, k, v를 split 하는 함수
    def split_heads(self, inputs, batch_size):
        inputs = tf.reshape(inputs, shape=(batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(inputs, perm=[0, 2, 1, 3])
    
    def call(self, inputs):
        query, key, value, mask = inputs['query'], inputs['key'], inputs['value'], inputs['mask']
        batch_size = tf.shape(query)[0]
        
        # 1. WQ, WK, WV에 해당하는 밀집층 지나기
        # q: (batch_size, query의 문장 길이, d_model)
        # k: (batch_size, key의 문장 길이, d_model)
        # v: (batch_size, value의 문장 길이, d_model)
        # 참고) 인코더(k, v)-디코더(q) 어텐션에서는 query 길이와 key, value의 길이는 다를 수 있음
        query = self.query_dense(query)
        key = self.key_dense(key)
        value = self.value_dense(value)
        
        # 2. 헤드 나누기
        # q : (batch_size, num_heads, query의 문장 길이, d_model/num_heads)
        # k : (batch_size, num_heads, key의 문장 길이, d_model/num_heads)
        # v : (batch_size, num_heads, value의 문장 길이, d_model/num_heads)
        query = self.split_heads(query, batch_size)
        key = self.split_heads(key, batch_size)
        value = self.split_heads(value, batch_size)
        
        # 3. scaled-dot product attention
        # (batch_size, num_heads, query의 문장 길이, d_model/num_heads)
        scaled_attention, _ = scaled_dot_product_attention(query, key, value, mask)
        # (batch_size, query의 문장 길이, num_heads, d_model/num_heads)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        
        # 4. 헤드 연결(concatenate) 하기
        # (batch_size, query의 문장 길이, d_model)
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        
        # 5. W0에 해당하는 밀집층 지나기
        # (batch_size, query의 문장 길이, d_model)
        outputs = self.dense(concat_attention)
        
        return outputs

8) Padding Mask¶

scaled-dot product attention 함수의 내부를 보면 mask라는 값을 인자로 받아서, 이 mask값에다가 -1e9라는 아주 작은 음수값을 곱한 후 attention score matrix에 더해주고 있음.
이는 입력 문장에 (PAD) 토큰이 있을 경우 attention에서 사실상 제외하기 위한 연산. 예를 들어 (PAD)가 포함된 입력 문장의 self-attention 에제를 확인
- attention을 수행하고 attention score matrix 행렬을 얻는 과정은 아래와 같음

그런데 사실 (PAD)의 경우 실질적인 의미를 지닌 단어가 아님. 따라서 transformer에서는 Key의 경우에 (PAD) 토큰이 존재한다면 이에 대해 유사도를 구하지 않도록 마스킹(Masking)을 해주고 attention에서 제외함
Attention score matrix에서 행에 해당하는 문장은 Query이고, 열에 해당하는 문장은 Key. 그리고 Key에 (PAD)가 있는 경우에는 열 전체를 Masking해줌

Masking을 하는 방법은 Attention score matrix의 masking 위치에 매우 작은 음수 값을 넣어 주는 것. 현재 Attention score 함수는 softmax 함수를 지나지 않은 상태.
앞서 배운 연산 순서라면 attention score 함수는 softmax 함수를 지나고, 그 후 Value 행렬과 곱해지게 됨. 그런데 masking 위치에 매우 작은 음수 값이 들어가 있으므로 Attention score matrix가 softmax를 지난 후에는 해당 위치의 값은 0에 굉장히 가까운 값이 되어 단어 간 유사도를 구하는 일에 (PAD) 토큰이 반영되지 않게 함

Padding mask를 구현하는 방법은 입력된 정수 시퀀스에서 패딩 토큰의 인덱스인지 아닌지를 판별하는 함수를 구현하는 것. 아래의 함수는 정수 시퀀스에서 0인 경우에는 1로 반환하고, 그렇지 않은 경우에는 0으로 변환하는 함수

In [16]:

def create_padding_mask(x):
    mask = tf.cast(tf.math.equal(x, 0), tf.float32)
    # (batch_size, 1, 1, key의 문장 길이)
    return mask[:, tf.newaxis, tf.newaxis, :]

임의의 정수 시퀀스 입력을 넣어서 어떻게 변환되는지 확인

In [17]:

print(create_padding_mask(tf.constant([[1, 21, 777, 0, 0]])))

tf.Tensor([[[[0. 0. 0. 1. 1.]]]], shape=(1, 1, 1, 5), dtype=float32)

위 벡터를 통해서 1의 값을 가진 위치의 열을 attention score matrix에서 masking하는 용도로 사용할 수 있음
위 벡터를 scaled-dot product attention의 인자로 전달하면, attention에서는 위 벡터에 매우 작은 음수값인 -1e9를 곱하고, 이를 행렬에 더해주어 해당 열을 전부 마스킹함

이상으로 첫번째 서브층인 multi-head attention을 구현함. 앞서 encoder는 두 개의 서브층(sublayer)로 나눠진다고 언급함. 이제 두번째 서브층인 Position-Wise FFNN에 관해 확인

5. Position-wise FFNN(Feed-Forward Neural Network)¶

지금까지 인코더를 설명했지만, FFNN는 인코더와 디코더에서 공통적으로 가지고 있는 서브층
FFNN은 fully connected라고 해석 가능. 아래는 position wise FFNN의 수식
- $FFNN(x) = MAX(0,xW_1+b_1)W_2+b_2$

여기서 x는 multi-head attention의 결과로 나온 (seqlen, $d{model}$)의 크기를 가지는 행렬을 말함
가중치 행렬 $W_1$은 ($d_{model}$, $d_{ff}$)의 크기를 가지고, 가중치 행렬 $W_2$는 ($d_{ff}$, $d_{model}$)의 크기를 가짐
논문에서 은닉층의 크기인 $d_{ff}$는 2048의 크기를 가짐

여기서 매개변수 $W_1, b_1, W_2, b_2$는 하나의 인코더 층 내에서는 다른 문장, 다른 단어들마다 정확하게 동일하게 사용됨. 하지만 인코더 층마다는 다른 값을 가짐

위의 그림에서 좌측은 인코더의 입력을 벡터단위로 봤을 때, 각 벡터들이 multi-head attention 층이라는 인코더 내의 첫번째 서브 층을 지나서 FFNN을 통과하는 것을 보여줌
이는 두번째 서브층인 position-wise FFNN을 의미함. 물론 실제로는 행렬로 연산되는데, 두번째 서브층을 지난 인코더의 최종 출력은 여전히 인코더의 입력의 크기였던 (seqlen, $d{model}$)의 크기가 보존되고 있음
하나의 인코더 층을 지난 이 행렬은 다음 인코더 층으로 전달되고, 다음 층에서도 동일한 인코더 연산이 반복됨

이를 구현하면 아래와 같음

In [18]:

# 인코더와 디코더 내부에서 사용할 코드
'''
outputs = Dense(units=dff, activation='relu')(attention)
outputs = Dense(units=d_model)(outputs)

'''

Out[18]:

"\noutputs = Dense(units=dff, activation='relu')(attention)\noutputs = Dense(units=d_model)(outputs)\n\n"

6. Residual connection(잔차 연결)과 Layer normalization(층 정규화)¶

인코더의 두 개 서브층에 대해서 이해했다면 인코더에 대한 설명은 거의 끝난 것. transformer 에서는 두 개의 서브층을 가진 인코더에 추가적으로 사용하는 기법이 있는데, Add & Norm
정확히는 residual connection과 layer normalization

위의 그림은 앞서 Position-wise FFNN을 설명할 때 사용한 앞선 그림에서 화살표와 Add & Norm을 추가한 그림
추가된 화살표들은 서브층 이전의 입력에서 시작되어 서브층의 출력 부분을 향하고 있는 것에 주목해야함

1) Residual connection¶

이를 이해하기 위해 어떤 함수 $H(x)$에 대해 이야기

입력 x와 x에 대한 어떤 함수 $F(x)$의 값을 더한 함수 $H(x)$의 구조를 보여줌. $F(x)$가 transformer에서는 서브층에 해당됨
다시 말해 잔차 연결은 서브층의 입력과 출력을 더하는 것을 말함
앞서 언급했듯 transforemr에서 서브층의 입력과 출력은 동일한 차원을 가지고 있으므로, 서브층의 입력과 서브층의 출력은 덧셈 연산을 할 수 있음.
이것이 위의 인코더 그림에서 각 화살표가 서브층의 입력에서 출력으로 향하도록 그려졌던 이유
잔차 연결은 Vision 분야에서 주로 사용되는 모델의 학습을 돕는 기법

이를 식으로 표현하면 $x + Sublayer(x)$라고 할 수 있음
가령, 서브층이 Multi-Head Attention 이었다면 잔차 연결 연산은 다음과 같음

$ H(x) = x+Multi-Head~Attention(x)$

2) Layer Normalization(층 정규화)¶

잔차 연결을 거친 결과는 이어서 층 정규화 과정을 거치게 됨. 잔차 연결의 입력을 x, 잔차 연결과 층 정규화 두 가지 연산을 모두 수행한 후의 결과 행렬을 LN이라고 하였을 때, 잔차 연결 후 층 정규화 연산을 수식으로 표현하면 다음과 같음
- $ LN = LayerNorm(x+Sublayer(x))$

층 정규화는 텐서의 마지막 차원에 대해서 평균과 분산을 구하고, 이를 가지고 어떤 수식을 통해 값을 정규화하여 학습을 도움
텐서의 마지막 차원이란 것은 transformer에서는 $d_{model}$ 차원을 의미
아래의 그림은 $d_{model}$ 차원의 방향을 화살표로 표현

층 정규화를 위해서 우선, 화살표 방향으로 각각 평균과 분산을 구함. 각 화살표 방향의 벡터를 $x_i$라고 명명

층 정규화를 수행한 후에는 벡터 $x_i$는 $ln_i$라는 벡터로 정규화가 됨

$ln_i = LayerNorm(x_i)$

층 정규화를 두가지 과정으로 나누어서 설명
- 1. 평균과 분산을 통한 정규화
- 1. Gamma와 Beta를 도입하는 것

1. $\hat{x}_{i, k} = \frac{x_{i, k}-μ_{i}}{\sqrt{σ^{2}_{i}+\epsilon}}$: 입실론은 분모가 0이 되는 것을 방지하는 값
1. Gamma와 Beta의 초기값은 각각 1과 0
  - 이 둘을 도입한 층 정규화의 최종 수식은 다음과 같으며, 감마 베타는 각각 학습 가능한 파라미터
  - $ ln_{i} = γ\hat{x}_{i}+β = LayerNorm(x_{i}) $

케라스에서는 LayerNormalization()을 제공하고 있으므로, 이를 가져와서 사용함

7. Encoder 구현하기¶

In [19]:

from tensorflow.keras.layers import Dropout, LayerNormalization

In [20]:

def encoder_layer(dff, d_model, num_heads, dropout, name='encoder_layer'):
    inputs = tf.keras.Input(shape=(None, d_model), name='inputs')
    
    # 인코더는 패딩 마스크 사용
    padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')
    
    # Multi-head Attention (첫번째 서브층 / 셀프 어텐션)
    attention = MultiHeadAttention(d_model, num_heads, name='attention')({
        'query': inputs, 'key': inputs, 'value': inputs, 'mask': padding_mask 
    }) # 패딩 마스크 사용, Q = K = V
    
    # Dropout + Add & Norm
    attention = Dropout(rate=dropout)(attention)
    attention = LayerNormalization(epsilon=1e-6)(inputs + attention)
    
    # Position-wise FFNN (두번째 서브층)
    outputs = Dense(units=dff, activation='relu')(attention)
    outputs = Dense(units=d_model)(outputs)
    
    # Dropout + Add & Norm
    outputs = Dropout(rate=dropout)(outputs)
    outputs = LayerNormalization(epsilon=1e-6)(attention + outputs)
    
    return tf.keras.Model(inputs=[inputs, padding_mask], outputs=outputs, name=name)

인코더의 입력으로 들어가는 문장에는 Padding이 있을 수 있으므로, Attention 시 패딩 토큰을 제외하도록 패딩 마스크를 사용함. 이는 Multi-Head Attention함수의 mask 인자값으로 padding_mask가 들어가는 이유
인코더는 총 두개의 서브층으로 이루어지는데, MHA와 FFNN. 각 서브층 이후에는 Add & Norm이 수행됨

위 코드는 하나의 인코더 블록. 즉, 하나의 인코더 층을 구현하는 코드. 실제 transformer에서는 num_layer 갯수만큼의 인코더 층을 사용하므로 이를 여러번 쌓는 코드를 별도로 구현해야함

8. Encoder 쌓기¶

In [21]:

from tensorflow.keras.layers import Embedding

In [22]:

def encoder(vocab_size, num_layers, dff, d_model, num_heads, dropout, name='encoder'):
    inputs = tf.keras.Input(shape=(None,), name='inputs')
    
    # 인코더는 패딩 마스크 사용
    padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')
    
    # 포지셔널 인코딩 + 드롭아웃
    embeddings = Embedding(vocab_size, d_model)(inputs)
    embeddings *= tf.math.sqrt(tf.cast(d_model, tf.float32))
    embeddings = PositionalEncoding(vocab_size, d_model)(embeddings)
    outputs = Dropout(rate=dropout)(embeddings)
    
    # 인코더를 num_layer에 쌓기
    for i in range(num_layers):
        outputs = encoder_layer(dff=dff, d_model=d_model, num_heads=num_heads, dropout=dropout, name=f'encoder_layer_{i}')([outputs, padding_mask])
        
    return tf.keras.Model(inputs=[inputs, padding_mask], outputs=outputs, name=name)

9. 인코더에서 디코더로¶

총 num_layers 개수 만큼의 총 연산을 순차적으로 한 후에 마지막 층의 인코더의 출력을 디코더에 전달함
인코더의 연산이 끝났으므로 디코더 연산이 시작되어 디코더 또한 총 num_layers 만큼의 연산을 하는데, 이 때 마다 인코더가 보낸 출력을 각 디코더 층 연산에 사용함.

10. 디코더의 첫번째 서브층: Self-Attention과 Look-Ahead Mask¶

위 그림과 같이 디코더도 인코더와 동일하게 임베딩 층과 포지셔널 인코딩을 거친 후의 문장 행렬이 입력됨.
transformer 또한 seq2seq와 마찬가지로 교사 강요(Teacher Forcing)을 사용하여 훈련되므로 학습 과정에서 디코더는 번역할 문장에 해당되는 문장 행렬을 한번에 입력받음. 그리고 디코더는 이 문장 행렬로부터 각 시점의 단어를 예측하도록 훈련됨
transformer의 디코더에서는 현재 시점의 예측에서 현재 시점보다 미래에 있는 단어들을 참고하지 못하도록 look-ahead mask를 도입함

Look-Ahead Mask는 디코더의 첫번째 서브층에서 이루어짐. 디코더의 첫번째 서브층인 Multi-head Self-Attention층은 인코더의 첫번째 서브층인 Multi-head Self-attention과 동일한 연산을 수행함
Attention score matrix에서 Masking을 적용한다는 것이 유일한 차이점

아래와 같이 Self-Attention을 통해 Attention score matirx를 얻음

자기 자신보다 미래에 있는 단어들은 참고하지 못하도록 다음과 같이 마스킹함

마스킹 된 후의 attention score matrix의 각 행을 보면 자기 자신과 그 이전 단어들만을 참고할 수 있음을 볼 수 있음. 그 외에는 근본적으로 Self-Attention이라는 점과, MHA을 수행한다는 점에서 인코더의 첫번째 서브층과 같음
Look-Ahead Mask는 패딩 마스크와 마찬가지로 앞서 구현한 scaled-dot product attention 함수에 mask라는 인자로 전달됨.
패딩 마스킹을 써야하는 경우에는 scaled-dot product attention 함수에 패딩 마스크를 전달하고, 룩-어헤드 마스킹을 써야하는 경우에는 scaled-dot product attention 함수에 look-ahead mask를 전달하게 됨

transformer에는 총 세 가지 어텐션이 존재하며, 모두 MHA를 수행하고, MHA 함수 내부에서 scaled-dot product attention 함수를 호출하는데 각 어텐션 시 함수에 전달하는 마스킹은 다음과 같음
- encoder의 self-attention: padding mask를 전달
- decoder의 1st sublayer인 masked-self attention: look-ahead mask를 전달
- decoder의 2st sublayer인 encoder-decoder attention: padding mask를 전달
이 때, look-ahead mask를 한다고 해서 padding mask가 불필요한 것이 아니므로 look-ahead mask는 padding mask를 포함하도록 구현함
look-ahead mask를 구현하는 방법은 padding mask때와 마찬가지로 마스킹을 하고자 하는 위치에는 1을, 마스킹을 하지 않는 위치에는 0을 리턴하도록 합니다.

In [23]:

# 디코더의 첫번째 서브층에서 미래 토큰을 mask 하는 함수
def create_look_ahead_mask(x):
    seq_len = tf.shape(x)[1]
    look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    padding_mask = create_padding_mask(x) # 패딩마스크도 포함
    
    return tf.maximum(look_ahead_mask, padding_mask)

In [24]:

create_look_ahead_mask(tf.constant([[1, 2, 0, 4, 5]]))

Out[24]:

<tf.Tensor: shape=(1, 1, 5, 5), dtype=float32, numpy=
array([[[[0., 1., 1., 1., 1.],
         [0., 0., 1., 1., 1.],
         [0., 0., 1., 1., 1.],
         [0., 0., 1., 0., 1.],
         [0., 0., 1., 0., 0.]]]], dtype=float32)>

14. Decoder의 두번째 Sublayer: Encoder-Decoder Attention¶

디코더의 두번째 서브층은 MHA를 수행한다는 점에서는 이전의 어텐션들과 같지만, Self-Attention이 아님
Self-Attention은 Q, K, V가 같은 경우를 말하는데, Encoder-Decoder Attention은 Query가 Decoder인 행렬인 반면, Key와 Value는 Encoder 행렬이기 때문.
- Encoder의 첫번째 서브층: Q = K = V
- Encoder의 두번째 서브층: Q = K = V
- Decoder의 두번째 서브층: Q: Decoder matrix/ Key = Value: Encoder matrix

디코더의 두번째 서브층을 확대해보면, 다음과 같이 인코더로부터 두 개의 화살표가 그러져있음

두 개의 화살표는 각각 Key와 Value를 의미하며, 이는 인코더의 마지막 층에서 온 행렬로부터 얻음.
반면, Query는 디코더의 첫번째 서브층의 결과 행렬로부터 얻는다는 점이 다름. Query가 디코더 행렬, Key가 인코더 행렬일 때, Attention score matrix를 구하는 과정은 다음과 같음

그 외에 MHA을 수행하는 과정은 다른 어텐션들과 같음

15. 디코더 구현¶

In [25]:

def decoder_layer(dff, d_model, num_heads, dropout, name='decoder_layer'):
    inputs = tf.keras.Input(shape=(None, d_model), name='inputs')
    enc_outputs = tf.keras.Input(shape=(None, d_model), name='encoder_outputs')
    
    # look-ahead mask(첫 번째 서브층)
    look_ahead_mask = tf.keras.Input(shape=(1, None, None), name='look_ahead_mask')
    
    # padding mask(두 번째 서브층)
    padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')
    
    # Multi-Head Attention (첫 번째 서브층 / masked self-attention)
    attention1 = MultiHeadAttention(d_model, num_heads, name='attention_1')(inputs={
        'query': inputs, 'key': inputs, 'value': inputs, # Q = K = V
        'mask': look_ahead_mask # 룩-어헤드 마스크
    })
    
    # 잔차 연결과 층 정규화
    attention1 = LayerNormalization(epsilon=1e-6)(attention1 + inputs)
    
    # Multi-Head Attention (두 번째 서브층 / Decoder-Encoder Attention)
    attention2 = MultiHeadAttention(d_model, num_heads, name='attention_2')(inputs={
        'query': attention1, 'key': enc_outputs, 'value': enc_outputs, 'mask': padding_mask 
    }) # 패딩 마스크,  Q != K = V
    
    # 드롭아웃 + 잔차 연결과 층 정규화
    attention2 = Dropout(rate=dropout)(attention2)
    attention2 = LayerNormalization(epsilon=1e-6)(attention2 + attention1)
    
    # Position-wise FFNN (세 번째 서브층)
    outputs = Dense(units=dff, activation='relu')(attention2)
    outputs = Dense(units=d_model)(outputs)
    
    # 드롭아웃 + 잔차 연결과 층 정규화
    outputs = Dropout(rate=dropout)(outputs)
    outputs = LayerNormalization(epsilon=1e-6)(outputs + attention2)
    
    return tf.keras.Model(inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask], outputs=outputs, name=name)

디코더는 총 세 개의 서브층으로 구성됨. 첫 번째와 두 번째 서브층 모두 MHA이지만, 첫 번째 서브층은 mask의 인자값으로 look_ahead_mask가 들어가는 반면, 두 번째 서브층은 mask의 인자값으로 padding_mask가 들어가는 것을 확인할 수 있음
이는 첫 번째 서브층은 masked self-attention을 수행하기 때문. 세 개의 서브층 모두 서브층 연산 후에는 Add & Norm이 수행되는 것을 확인 가능

인코더와 마찬가지로 디코더도 num_layers개만큼 쌓는 코드가 필요함.

16. 디코더 쌓기¶

In [26]:

def decoder(vocab_size, num_layers, dff, d_model, num_heads, dropout, name='decoder'):
    inputs = tf.keras.Input(shape=(None,), name='inputs')
    enc_outputs = tf.keras.Input(shape=(None, d_model), name='encoder_outputs')
    
    # 디코더는 룩어헤드 마스크(첫 번째 서브층)와 패딩 마스크(두 번째 서브층) 둘 다 사용.
    look_ahead_mask = tf.keras.Input(shape=(1, None, None), name='look_ahead_mask')
    padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')
    
    # 포지셔널 인코딩 + 드롭아웃
    embeddings = Embedding(vocab_size, d_model)(inputs)
    embeddings *= tf.math.sqrt(tf.cast(d_model, tf.float32))
    embeddings = PositionalEncoding(vocab_size, d_model)(embeddings)
    outputs = Dropout(rate=dropout)(embeddings)
    
    # 디코더를 num_layers개 쌓기
    for i in range(num_layers):
        outputs = decoder_layer(dff=dff, 
                                d_model=d_model, 
                                num_heads=num_heads, 
                                dropout=dropout, 
                                name=f'decoder_layer_{i}')(inputs=[outputs, enc_outputs, look_ahead_mask, padding_mask])
    
    return tf.keras.Model(inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask], outputs=outputs, name=name)

17. Transformer 구현하기¶

지금까지 구현한 인코더와 디코더 함수를 조합하여 trnsformer를 조합할 차례
인코더의 출력은 디코더에서 인코더-디코더 어텐션에서 사용되기 위해 디코더로 전달해줌
그리고 디코더의 끝단에는 다중 클래스 분류 문제를 풀 수 있도록, vocab_size 만큼의 뉴런을 가지는 신경망을 추가

In [27]:

def transformer(vocab_size, num_layers, dff, d_model, num_heads, dropout, name='transformer'):
    
    # 인코더의 입력
    inputs = tf.keras.Input(shape=(None,), name='inputs')
    
    # 디코더의 입력
    dec_inputs = tf.keras.Input(shape=(None,), name='dec_inputs')
    
    # 인코더의 패딩 마스크
    enc_padding_mask = tf.keras.layers.Lambda(create_padding_mask, output_shape=(1, 1, None), name='enc_padding_mask')(inputs)
    
    # 디코더의 룩어헤드 마스크(첫 번째 서브층)
    look_ahead_mask = tf.keras.layers.Lambda(create_look_ahead_mask, output_shape=(1, None, None), name='look_ahead_mask')(dec_inputs)
    
    # 디코더의 패딩 마스크(두 번째 서브층)
    dec_padding_mask = tf.keras.layers.Lambda(create_padding_mask, output_shape=(1, 1, None), name='dec_padding_mask')(inputs)
    
    # 인코더의 출력은 enc_outputs. 디코더로 전달됨
    enc_outputs = encoder(vocab_size=vocab_size, num_layers=num_layers, dff=dff, d_model=d_model, num_heads=num_heads, dropout=dropout)(inputs=[inputs, enc_padding_mask])
    # 인코더의 입력은 입력문장과 패딩 마스크
    
    # 디코더의 출력은 dec_outputs, 출력층으로 전달됨
    dec_outputs = decoder(vocab_size=vocab_size, num_layers=num_layers, dff=dff, d_model=d_model, num_heads=num_heads, dropout=dropout)(inputs=[dec_inputs,
                                                                                                                                               enc_outputs,
                                                                                                                                               look_ahead_mask,
                                                                                                                                               dec_padding_mask])
    
    # 다음 단어 예측을 위한 출력층
    outputs = Dense(units=vocab_size, name='outputs')(dec_outputs)
    
    return tf.keras.Model(inputs=[inputs, dec_inputs], outputs=outputs, name=name)

In [28]:

small_transformer = transformer(vocab_size = 9000, # 임의로 정함
                                num_layers = 4, 
                                dff = 512,
                                d_model = 128,
                                num_heads = 4,
                                dropout = 0.1,
                                name="small_transformer")

tf.keras.utils.plot_model(small_transformer, to_file='small_transformer.png', show_shapes=True)

('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')

18. loss function 및 learning rate¶

transformer의 경우 학습률(learning rate)은 고정된 값을 유지하는 것이 아니라, 학습 경과에 따라 변하도록 설계함
아래의 공식으로 lr을 계산하여 사용하였으며, warmup_steps의 값으로는 4,000을 사용하였습니다.
${lrate = d_{model}^{-0.5} × min(\text{step_num}^{-0.5},\ \text{step_num} × \text{warmup_steps}^{-1.5})}$

In [29]:

def loss_function(y_true, y_pred):
    y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))
    
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')(y_true, y_pred)
    mask = tf.cast(tf.not_equal(y_true, 0), tf.float32)
    loss = tf.multiply(loss, mask)
    
    return tf.reduce_mean(loss)

In [30]:

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)
        self.warmup_steps = warmup_steps
        
    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps**-1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

In [31]:

sample_learning_rate = CustomSchedule(d_model=128)

plt.plot(sample_learning_rate(tf.range(200000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")

Out[31]:

Text(0.5, 0, 'Train Step')

In [ ]:

'자연어, 비전' 카테고리의 다른 글

카카오톡 대화 내용으로 개인별 워드클라우드(wordcloud) 그리기 (1)	2022.06.22
시퀀스 모델링 (0)	2020.11.10
워드 임베딩 (0)	2020.11.10
단어 유사도 정리 (0)	2020.11.09
자연어 처리를 위한 전처리 과정 정리 (0)	2020.11.06

현재글transformer 구현 및 설명

데분데싸

transformer 구현 및 설명

transformer 설명자료¶

1. transformer의 하이퍼 파라미터 정리¶

2. Positional Encoding¶

3. Attention¶

4. Encoder Self-Attention¶

1) Self-Attention 의미와 이점¶

2) Q, K, V vector 얻기¶

3) Scaled dot-product Attention¶

4) 행렬 연산으로 일괄처리¶

5) Scaeld-Dot Product Attention 구현¶

6) Multi-Head Attention¶

7) Multi-head Attention 구현¶

8) Padding Mask¶

5. Position-wise FFNN(Feed-Forward Neural Network)¶

6. Residual connection(잔차 연결)과 Layer normalization(층 정규화)¶

1) Residual connection¶

2) Layer Normalization(층 정규화)¶

7. Encoder 구현하기¶

8. Encoder 쌓기¶

9. 인코더에서 디코더로¶

10. 디코더의 첫번째 서브층: Self-Attention과 Look-Ahead Mask¶

14. Decoder의 두번째 Sublayer: Encoder-Decoder Attention¶

15. 디코더 구현¶

16. 디코더 쌓기¶

17. Transformer 구현하기¶

18. loss function 및 learning rate¶

'자연어, 비전' 카테고리의 다른 글

'자연어, 비전'의 다른글

티스토리툴바

transformer 구현 및 설명

transformer 설명자료¶

1. transformer의 하이퍼 파라미터 정리¶

2. Positional Encoding¶

3. Attention¶

4. Encoder Self-Attention¶

1) Self-Attention 의미와 이점¶

2) Q, K, V vector 얻기¶

3) Scaled dot-product Attention¶

4) 행렬 연산으로 일괄처리¶

5) Scaeld-Dot Product Attention 구현¶

6) Multi-Head Attention¶

7) Multi-head Attention 구현¶

8) Padding Mask¶

5. Position-wise FFNN(Feed-Forward Neural Network)¶

6. Residual connection(잔차 연결)과 Layer normalization(층 정규화)¶

1) Residual connection¶

2) Layer Normalization(층 정규화)¶

7. Encoder 구현하기¶

8. Encoder 쌓기¶

9. 인코더에서 디코더로¶

10. 디코더의 첫번째 서브층: Self-Attention과 Look-Ahead Mask¶

14. Decoder의 두번째 Sublayer: Encoder-Decoder Attention¶

15. 디코더 구현¶

16. 디코더 쌓기¶

17. Transformer 구현하기¶

18. loss function 및 learning rate¶

'자연어, 비전' 카테고리의 다른 글

'자연어, 비전'의 다른글

관련글

티스토리툴바