Generative Model Learning (1) - KL Divergence, Maximum Likelihood

이번 포스팅에서는 딥러닝 기반의 generative model의 학습 과정을 배우기 위해 필요한 기초 지식들을 정리해보려 한다.

이제까지 \(p(x)\)를 어떻게 표현할지를 알아보았는데, 이번 장에서는 어떻게 학습할지를 알아보자.

Learning a Generative Model

Fig 1. Representation and learning of generative model

Generation (Sampling) : Data \(\mathbf{x}\)의 distribution \(p(\mathbf{x})\)로부터 새로운 데이터를 샘플링 하는 것을 말한다. \(\mathbf{x}_\text{new} \sim p(\mathbf{x})\)
Density estimation (Anomaly detection) : Data \(x\)가 (예시에서) 강아지처럼 보인다면 \(p(x)\)가 높을 것이고, 아니면 낮을 것이다.
Unsupervised representation learning (Feature learning) : 모델이 강아지라는 data는 귀, 꼬리 등의 feature를 갖는다는 것을 학습한다.

어떤 data의 분포 \(P_\text{data}\)에서 \(m\)개의 sample을 뽑은 dataset \(\mathcal{D}\)가 주어져있다고 가정하자.

각 샘플은 이미지를 예로 들면 pixel intensity와 같이 random variable에 값이 배정된 형태를 말한다. 그리고 일반적으로는 이러한 data들은 서로 independent and identically distributed (IID) 조건을 만족한다고 가정한다.

그리고 model family \(\mathcal{M}\)이 주어져 있고, 이 중에서 좋은 모델 \(\widehat{\mathcal{M}}\)을 학습하는 것이 목표이다. (이 모델은 distribution \(p_{\widehat{\mathcal{M}}}\)을 정의한다.)

위 예시에서 \(\boldsymbol{\theta}\)로 parameterize된생성모델을 학습한다는 것은 model family의 data distribution을 실제 data distribution에 가깝게 만든다, 즉 두 data distribution간의 거리(distance) \(d(P_\text{data}, P_{\boldsymbol{theta}})\)를 줄인다는 의미이다.

하지만, 일반적으로 이는 거의 불가능하다. 그 이유는 다음과 같다.

Data의 개수가 한정적이므로, 이를 통해서는 실제 (underlying) distribution을 근사하기 힘들다.
모든 data에 대해 학습하기에는 계산량이 너무 많다.

예를 들어, 28 by 28 (784개 pixel) 흑백 이미지가 있다고 하면, 784개의 binary variable로 이루어진 vector \(\mathbf{X}\)로 표현할 수 있다. 이때 모델이 생성할 수 있는 state(image)는 \(2^{784} \approx 10^{236}\)가지이다. 이러한 model 모두에 대해 위에서처럼 distance를 계산한다는 것은 불가능하다.

그렇다면 distribution \(P_\text{data}\)를 가장 잘 근사하는 모델 \(\widehat{\mathcal{M}}\)은 어떻게 구해야할까? 이에 대한 해답은 어떤 문제이냐에 따라 달라진다.

Density estimation : Full distribution을 구하여 추후에 원하는 conditional probability를 계산하는 데 사용한다.
Specific prediction task : 어떤 것을 예측하기 위해 distribution을 사용한다.
- Email이 spam? or not?
- Video의 다음 frame?
Structure or knowledge discovery : Model 자체를 근사한다.
- 어떤 유전자가 다른 유전자와 어떻게 상호작용하는가?
- 암을 유발하는 인자가 무엇인가?

이 중에서도 이번에는 density estimation의 경우만 다뤄볼 것이다.

Learning in Density Estimation

Density estimation의 경우, data 전체의 distribution을 구한 후, 이것을 활용하여 inference 과정에서 원하는 결과를 얻는다.

Fig 1에서처럼 distribution \(P_{\boldsymbol{\theta}}\)가 data distribution \(P_\text{data}\)와 최대한 가까워지도록 하고싶은 경우를 예로 들 수 있다.

이를 위해 distribution간의 거리 개념인 KL-divergence를 활용한다.

Kullback-Leibler Divergence (KL-divergence)

KL divergence는 두 distribution 간의 거리 개념이고, 다음과 같이 정의한다.

\( D_{KL} (p \Vert q) = \mathbb{E}_{x \sim p} \left[ \log \cfrac{p(x)}{q(x)} \right] = \sum\limits_{x \in X} p(x) \log \cfrac{p(x)}{q(x)} \)

수식을 해석해보면, \(p\)에서 뽑은 샘플들을 \(p\)가 아닌 \(q\)를 기반으로 설명하기 위해 추가적으로 필요한 bit 수의 기댓값이다.

(아래부터는 subscript를 없애고 간단히 \(D\)로 표현할 것이다.)

KL divergence는 모든 probability distribution function \(p, q\)에 대해 \(D(p \Vert q) \geq 0\)을 만족한다. (\(0\)이면 \(p=q\)이고, 역도 성립한다.)

주의할 점은, 엄밀히 따지면 KL-divergence는 asymmetric, 즉 \(D(p \Vert q) \neq D( q \Vert p)\)이기 때문에 distance function이 아니라는 점이다.

Fig 2. Asymmetric property of KL divergence

위 그림은 p(x)가 두 개의 normal distribution, q가 하나의 normal distribution이라 했을 때, q를 KL divergence를 최소화하도록 학습시킨 것이다. KL divergence에서 \(D(p \Vert q)\)인가 \(D(q \Vert p)\)인가에 따라 최적의 q가 달라진다는 것을 알 수 있다.

조금 더 자세히 알아보자.

Information theory and Entropy

직관적인 개념인 '정보'를 수량화하려 하는데, 이를 위해서는 다음과 같은 것들을 고려해야 한다.

일어 날 것 같은 사건(likely events)은 information이 적을 것이다. (일어날 것이라 보장된 사건은 information이 0일 것이다.)
잘 일어나지 않을 것 같은 사건(less likely events)은 information이 많을 것이다.
서로 독립적인 사건은 추가적인 information을 가질 것이다.

한마디로, 드물게 일어나는 일일수록 의미가 클 것이라는 개념이다. 이에 따라 사건 \(\mathbf{x} = x\)의 self information을 다음과 같이 정의한다.

\( I(x) = - \log P(x) = \log \left( \cfrac{1}{P(x)} \right) \)

이를 'surprise'라고도 한다. Log를 취하는 이유는 이 개념이 probability의 inverse 개념인데, 단순히 역수를 취했다가는 0으로 나누게될 수 있기 때문이다.

여기서 self information은 하나의 사건만 다루는데, 전체 확률 분포에서의 정보량은 (Shannon) entropy라 하고, 다음과 같이 정의한다.

\( \begin{align*} H(\mathbf{x}) &= \mathbb{E}_{\mathbf{x} \sim P} [I(x)] \\ &= \sum\limits_{x} P(x) I(x) \\ &= - \mathbb{E}_{\mathbf{x} \sim P} [\log P(x)] \\ &= - \sum\limits_{x} P(x) \log \left( P(x) \right) \end{align*} \)

즉 entropy란 어떤 분포에서 사건이 일어났을 때, 이 사건이 가지는 정보량(information)의 기댓값이다.

가장 단순한 binary random variable의 경우를 예로 들어보자. 보통 이 떄에는 log의 밑을 2로 설정한다.

동전을 100번 던졌는데, 앞면은 90번, 뒷면은 10번이 나왔다. 이때, (H, H, T)의 surprise를 구해보자. \(P(H) = 0.9\) (앞면 - head), \(P(T) = 0.1\) (뒷면 - tail)이므로 surprise는 아래와 같이 구한다.

\(\text{surprise} = \log_2 \cfrac{1}{0.9 \times 0.9 \times 0.1} = \log_2{1} - \left( \log_2{0.9} + \log_2{0.9} + \log_2{0.1} \right) = 3.62 \)

그리고, 이때 동전을 던질 때마다의 surprise의 평균, 즉 entropy는 \(H\)의 information \(\log_2 \left( \cfrac{1}{P(H)} \right) = 0.15 \), \(T\)의 information \(\log_2 \left( \cfrac{1}{P(T)} \right) = 3.32\)를 이용하여 다음과 같이 구할 수 있다.

\( P(H) \times I(H) + P(T) \times I(T) = 0.9 \times 0.15 + 0.1 \times 3.32 = 0.467 \)

Binary entropy의 경우 항이 두 개밖에 없으므로 다음과 같이 나타낼 수 있다. (Binary entropy loss 등 딥러닝에서 아주 자주 사용된다! Binary classification task에 활용되는 경우 binary cross entropy(BCE) loss라는 이름을 갖게 되는 것이다.)

\( -P \log P - (1 - P) \log (1 - P) \)

KL Divergence

이제 다시 KL divergence로 돌아와보자.

\( D_{KL} (p \Vert q) = \mathbb{E}_{x \sim p} \left[ \log \cfrac{p(x)}{q(x)} \right] = \sum\limits_{x \in X} p(x) \log \cfrac{p(x)}{q(x)} \)

위 KL divergence 정의를 cross entropy의 관점에서 보면,

\( D_{KL} (p \Vert q) = \underbrace{- \sum\limits_{x} p(x) \log{q(x)}}_{\text{cross entropy } H(P, Q)} - \left( \underbrace{- \sum\limits_{x} p(x) \log{p(x)}}_{\text{real information } H(P)} \right) \)

첫 번째 항은 모델이 예측한 distribution \(q\)와 실제 data의 distribution \(p\)의 cross entropy이고, 두 번째 항은 \(p\)의 entropy \( H(P) = - \sum\limits_{x} p(x) \log \left( p(x) \right)\)임을 알 수 있다.

Maximum Likelihood Estimation (MLE)

다시 density estimation 학습 과정으로 돌아와보면, 이제 우리는 모델이 추정한 probability density와 실제 데이터의 probability density에 KL-divergence를 적용하여 그 차이를 알아볼 것이다.

\( D(P_\text{data} \Vert P_{\boldsymbol{\theta}}) = \mathbb{E}_{x \sim P_\text{data}} \left[ \log \cfrac{P_\text{data} (x)}{P_{\boldsymbol{\theta}}(x)} \right] = \sum\limits_{x} P_\text{data} (x) \cfrac{P_\text{data}(x)}{P_{\boldsymbol{\theta}}(x)} \)

\(D(P_\text{data} | P_{\boldsymbol{\theta}}) = 0\)이면 두 distribution은 같은 distribution이며, 역도 성립한다.

이를 단순화시켜보자.

\( D(P_\text{data} \Vert P_{\boldsymbol{\theta}}) = \mathbb{E}_{x \sim P_\text{data}} \left[ \log \cfrac{P_\text{data} (x)}{P_{\boldsymbol{\theta}}(x)} \right] = \mathbb{E}_{x \sim P_\text{data}} \left[ \log P_\text{data} (x) \right] - \underbrace{\mathbb{E}_{x \sim P_\text{data}} \left[ \log P_{\boldsymbol{\theta}} (x) \right]}_{\text{expected log-likelihood}} \)

위 식에서, 첫 번째 term은 \(P_{\boldsymbol{\theta}}\)에 의존하지 않는다. 따라서 KL divergence를 최소화한다는 것은 두 번째 term인 log-likelihood의 기댓값을 최대화한다는 의미가 된다.

\( \underset{P_{\boldsymbol{\theta}}}{\operatorname{argmin}} D(P_\text{data} \Vert P_{\boldsymbol{\theta}} ) = \underset{P_{\boldsymbol{\theta}}}{\operatorname{argmin}} - \mathbb{E}_{x \sim P_\text{data}} \left[ \log P_{\boldsymbol{\theta}} (x) \right] = \underset{P_{\boldsymbol{\theta}}}{\operatorname{argmax}} \mathbb{E}_{x \sim P_\text{data}} \left[ \log P_{\boldsymbol{\theta}} (x) \right] \)

하지만 위 식을 objective function(loss function)으로 학습하기에는 두 가지 문제가 있다.

Log때문에 \(x\)를 샘플링할 때 \(P_{\boldsymbol{\theta}} (x) \approx 0\)인 경우 objective의 (절댓)값이 너무 커진다.
\(H(P_\text{data})\)를 무시했으므로, optimum에 얼마나 가까운지를 알 수 없다. 일반적으로 실제 data의 distribution \(P_\text{data}\)는 정확히 계산할 수 없다. (intractable)
1. 즉, 기댓값 \( \mathbb{E}_{x \sim P_\text{data}} \left[ \log P_{\boldsymbol{\theta}} (x) \right] \)에서 \(P_\text{data}\)를 모르므로 계산할 수 없고, KL divergence term도 계산이 불가능하다.

Approximation of expected log-likelihood with the empirical log-likelihood

따라서, empirical log-likelihood로 해당 term을 근사한다. 쉽게 말하면 주어진(샘플된) dataset \(\mathcal{D}\)를 활용하여 해당 term을 근사하는 방법이다. Empirical log-likelihood는 다음과 같이 정의한다.

\( \mathbb{E}_{\mathcal{D}} \left[ \log P_{\boldsymbol{\theta}} (x) \right] = \cfrac{1}{\left\vert \mathcal{D} \right\vert} \sum\limits_{x \in \mathcal{D}} \log P_{\boldsymbol{\theta}} (x) \)

위를 활용한 maximum likelihood learning은 다음과 같이 표현할 수 있다.

\( \underset{P_{\boldsymbol{\theta}}}{\operatorname{argmax}} \cfrac{1}{\left\vert \mathcal{D} \right\vert} \sum\limits_{x \in \mathcal{D}} \log P_{\boldsymbol{\theta}} (x) \)

이는 다음과 같이 data의 likelihood를 최대화하는 수식과 같다. (가정 : iid)

\( \underset{P_{\boldsymbol{\theta}}}{\operatorname{argmax}} P_{\boldsymbol{\theta}} \left( x^{(1)}, \cdots , x^{(m)} \right) = \underset{P_{\boldsymbol{\theta}}}{\operatorname{argmax}} \prod\limits_{x \in \mathcal{D}} P_{\boldsymbol{\theta}} (x) \)

저작자표시 비영리 변경금지

Generative Model Learning (1) - KL Divergence, Maximum Likelihood

Learning a Generative Model

Learning in Density Estimation

Kullback-Leibler Divergence (KL-divergence)

Information theory and Entropy

KL Divergence

Maximum Likelihood Estimation (MLE)

Approximation of expected log-likelihood with the empirical log-likelihood

전체 카테고리

블로그 인기글

티스토리툴바

Learning a Generative Model

Learning in Density Estimation

Kullback-Leibler Divergence (KL-divergence)

Information theory and Entropy

KL Divergence

Maximum Likelihood Estimation (MLE)

Approximation of expected log-likelihood with the empirical log-likelihood

전체 카테고리

최근 글

최근댓글

블로그 인기글

티스토리툴바