Flow-based Models (Normalizing Flow) (1)

Autoregressive models 글에서 autoregressive generative model에 대해 다뤄보았다.

Autoregressive (Generative) Models (2)

목차 이전 글에 이어, autoregressive generative model에 어떤 모델이 있는지 좀 더 살펴보자. 다음은 다양한 deep generative model 의 비교 표이다. Autoencoder based ARM Autoregressive Models (ARM) vs Autoencoders (AE) Autoenco

jjuke-brain.tistory.com

Fig 1. Comparison of deep generative models (1)

Deep generative model들 중 이번에는 flow-based model(=normalizing flow model)을 알아보자.

Introduction

Fig 2. Taxonomy of deep generative models

Autoregressive generative model의 경우, likelihood를 계산할 수 있고, long-range statistics를 학습할 수 있다는 장점을 갖지만, sampling이 느리고 feature를 학습할 방법은 없다 (lack a latent representation).

Flow-based model(=Normalizing flow model)은 likelihood 계산을 할 수 있으면서(\(p(\mathbf{x})\)를 직접적으로 모델링하면서) latent variable을 설계할 수도 있다.

Flow-based model의 핵심은 simple한 prior distribution으로부터 복잡한 data distribution을 모델링하는 것이다.

본격적으로 flow-based model에 대해 알아보기 이전에 다음과 같이 알아두어야 할 사전지식이 있다.

Change of variables (in porbability)
Jacobian of invertible functions

Change of Variables Theorem in Probability Densify Function

Change of variables(변수 변환)를 영어로 표현해서 낯선데, 사실 고등학교 때 배운 내용이다. 다음 식을 보자.

\( x^6 - 9 x^3 + 8 = 0\)

위 식에서 x의 해를 구할 때, \(x^3 = t\)로 치환하면 다음과 같이 이차방정식으로 표현된다.

\( t^2 - 9t + 8 = 0\)

그 이후 t의 해를 구하고, 그 해를 이용하여 x의 해를 구하는 방식으로 문제를 풀었다.

이를 확률 개념에 적용해보자.

Single random variable \(z \sim \pi(z)\)에 대해, invertible한 일대일 함수 \(f\)를 가정하고, 새로운 random variable \(x = f(z)\)를 구성한다. 이때 \(x\)의 확률 분포 \(p(x)\)와 \(z\)의 분포 \(\pi(z)\)는 다음을 만족한다.

\( \int{p(x) dx} = \int \pi (z) dz = 1 \)

따라서 함수 \(f\)가 scalar → scalar transformation인 경우, 다음과 같이 변수를 변환해줄 수 있다.

\( p(x) = \pi(z) \left\vert \cfrac{dz}{dx} \right\vert = \pi(f^{-1}(x)) \left\vert \cfrac{df^{-1}}{dx} \right\vert \)

이를 multivariable로 확장, 즉 함수 \(f\)가 vector → vector transformation인 경우로 확장하면 다음과 같이 변수를 변환한다.

\( p(\mathbf{x}) = \pi(\mathbf{z}) \left\vert \operatorname{det} \cfrac{\partial \mathbf{z}}{\partial \mathbf{x}} \right\vert = \pi( f^{-1} (\mathbf{x})) \left\vert \operatorname{det} \cfrac{\partial f^{-1}}{\partial \mathbf{x}} \right\vert \)

식의 determinant term은 chage of variables를 통해 invertible transformation \(f^{-1}\)을 적용한 후의 distribution \(\pi(\mathbf{z}\)\)를 normalize해주는 역할을 한다.

이를 통해 알지 못하는 확률 분포 \(p(\mathbf{x})\)를 \(\mathbf{z}\)의 probability density function으로 표현해줄 수 있다. (이론상으로는 거의 대부분의 복잡한 distribution을 간단한 distribution으로 바꿀 수 있다.)

Jacobian of Invertible Functions

위와 같이 \(\mathbf{x} = (x_1, \cdots, x_n) = f(\mathbf{z}) = (f_1(z), \cdots, f_n(z)) \)의 예시에서 \(f\)의 Jacobian \(\mathbf{J}_{f}\)는 다음과 같다.

\( \mathbf{J}_{f} (\mathbf{z}) = \cfrac{\partial f}{\partial \mathbf{z}} = \begin{bmatrix} \cfrac{\partial f_1}{\partial z_1} & \cdots & \cfrac{\partial f_1}{\partial z_n} \\ \vdots & \ddots & \vdots \\ \cfrac{\partial f_n}{\partial z_1} & \cdots & \cfrac{\partial f_n}{\partial z_n} \end{bmatrix} \)

여기서 \(x = f(z)\)이고, \(f\)가 invertible function이면 다음을 만족한다. (Inverse function theorem)

\( \cfrac{d f^{-1}(x)}{dx} = \cfrac{d z}{d x} = \left( \cfrac{d x}{d z} \right)^{-1} = \left( \cfrac{df(z)}{dz} \right)^{-1} \)

이를 Jacobian 각 항에 모두 적용하면, inverse function의 Jacobian은 다음을 만족한다.

\( \mathbf{J}_{f^{-1}} (z) = \mathbf{J}_f (z)^{-1} \)

이제 분포를 알고 있는 \(\mathbf{z} \sim \pi(z) \)로부터 \(\mathbf{x}\)의 분포를 구해보자.

변수 변환과 Jacobian의 성질을 통해 다음 식을 얻을 수 있다.

\( p(\mathbf{x}) = \pi \left( \mathbf{z} = f^{-1} (\mathbf{x}) \right) \left\vert \operatorname{det} \cfrac{\partial f^{-1}}{\partial \mathbf{x}} \right\vert = \pi \left( \mathbf{z} \right) \left\vert \mathbf{J}_f (\mathbf{z}) \right\vert^{-1} \)

다음은 uniform distribution에 대해 invertible function \(f\)를 적용한 예시이다.

Fig 3. Invertible transformation and volume changes

위쪽은 Jacobian의 determinant가 1임에 따라 volume이 보존된 bijection (volume-preserving bijection)을 나타내고, 가운데와 아래 그림은 Jacobian 값이 1보다 작은 경우 volume은 작아지지만 density가 커지고, 1보다 큰 경우에는 volume이 커지지만 density가 작아지는 경우를 나타낸다.

Flow-based Models

본격적으로 flow-based model에 대해 알아보자.

Simple Prior to Complex Data Distributions

앞서 언급했듯, flow-based model의 핵심은 간단한 prior distribution \(p_0(\mathbf{z}_0)\)을 활용하여 복잡한 data distribution \(p(\mathbf{x})\)을 표현하는 것이다. prior distribution \(\mathbf{z}_0\)이 normal distribution \(\mathcal{N}(\mathbf{z}_0 | 0, \mathbf{I})\)을 따른다고 가정했을 때, 이를 수식으로 나타내면 다음과 같다.

\( p(\mathbf{x}) = \pi \left( \mathbf{z}_0 = f^{-1} (\mathbf{x}) \right) \prod\limits_{i=1}^K \left\vert \operatorname{det} \cfrac{\partial f_i (\mathbf{z}_{i-1})}{\partial \mathbf{z}_{i-1}} \right\vert^{-1} = \pi \left( \mathbf{z}_0 = f^{-1} (\mathbf{x}) \right) \prod\limits_{i=1}^K \left\vert \mathbf{J}_{f_i} (\mathbf{z}_{i-1}) \right\vert^{-1} \)

Normal distribution 하나(unimodal distribution, latent space)를 transform하여 multimodal distribution (data space)을 나타낸 예시는 다음과 같다.

Fig 4. An example of transforming a unimodal distribution to a multimodal distribution

Flow-based model이란?

Flow-based model에서 sampling은 forward transformation \(\mathbf{z} \mapsto \mathbf{x}\)을 통해 한다.

\( \mathbf{z} \sim p_Z (\mathbf{z}), \quad \mathbf{x} = f_{\boldsymbol{\theta}} (\mathbf{z}) \)

또한, inference network를 따로 둘 필요 없이 inverse transformation을 통해 latent representation을 추론(infer)한다.

\( \mathbf{z} = f_{\boldsymbol{\theta}}^{-1} (\mathbf{x}) \)

Flow는 \(\boldsymbol{\theta}\)로 parameterize한 invertible transformation \(f_i\)로 다음과 같이 표현할 수 있다.

\( \mathbf{x} := \mathbf{z}_K = f_K \circ \cdots \circ f_1 = f_K \left( f_{K-1} \left( \cdots \left( f_1(\mathbf{z}_0) \right) \right) \right) \triangleq f(\mathbf{z}_0) \)

즉, Gaussian 등의 간단한 distribution으로 시작하여 \(K\)번의 invertible transformation을 거치는 것이다.

최종 distribution을 \(p(\mathbf{x})\)로, \(\mathbf{z}\)에 대한 distribution을 \(\pi\)로 표시하고, log를 취해주면 다음 식을 얻는다.

\( \begin{align*} \log p(\mathbf{x}) = \log \pi_K(\mathbf{z}_K) &= \log \pi_{K-1}(\mathbf{z}_{K-1}) - \log \left\vert \operatorname{det} \cfrac{d f_{K}}{d \mathbf{z}_{K-1}} \right\vert \\ &= \log \pi_{K-2}(\mathbf{z}_{K-2}) - \log \left\vert \operatorname{det} \cfrac{d f_{K-1}}{d \mathbf{z}_{K-2}} \right\vert - \log \left\vert \operatorname{det} \cfrac{d f_K}{d \mathbf{z}_{K=1}} \right\vert \\ &= \cdots \\ &= \log \pi_{0}(\mathbf{z}_{0}) - \sum\limits_{i=1}^K \log \left\vert \operatorname{det} \cfrac{d f_{i}}{d \mathbf{z}_{i-1}} \right\vert \end{align*} \)

Prior distribution을 normal distribution으로 설정하고, Jacobian으로 나타내어 정리하면 다음과 같다.

\( \log p(\mathbf{x}) = \log \mathcal{N} \left( \mathbf{z}_0 = f^{-1}(\mathbf{x}) | 0, \mathbf{I} \right) - \sum\limits_{i=1}^K \log \left\vert \mathbf{J}_{f_i} (\mathbf{z}_{i-1}) \right\vert \)

흥미로운 점은, 첫 번째 term \( \log \mathcal{N} \left( \mathbf{z}_0 = f^{-1}(\mathbf{x}) | 0, \mathbf{I} \right) \)이 \(0\)과 \(f^{-1}(\mathbf{x}) + \text{const}\) 사이의 Mean Squared Error(MSE)와 같다.

Gaussian distribution의 probability density function \( f(x) = \operatorname{exp} \left( - \cfrac{(x - \mu)^2}{2 \sigma^2} \right) \)에 log를 취하면 MSE \( - \cfrac{(y_i - \hat{y})^2}{2 \sigma^2} \)와 같은 형태임을 알 수 있다.

또한, 두 번째 term \( \sum\limits_{i=1}^K \log \left\vert \mathbf{J}_{f_i} (\mathbf{z}_{i=1}) \right\vert \)을 통해 Jacobian에 의한 change of volume을 invertible transformations \(\{f_i\}\)의 regularization 효과로 볼 수 있다. ('Normalizing flow'라는 이름이 붙은 이유이기도 하다.)

Deep generative model 관점에서는 이러한 invertible transformation을 어떻게 모델링할지 고민해봐야 한다.

Neural network는 flexible하며 학습이 쉽지만, 아무 neural network나 쓸 수 있는 것은 아니다. 다음과 같은 조건을 충족해야 한다.

Invertible transformation이어야 하므로, invertible neural network를 사용해야 한다.
위 식에서 두 번째 term, 즉 logarithm of Jacobian-determinant 계산이 가능하며, 쉬워야 한다.

위 두 가지 조건을 만족하는 model을 normalizing flow 혹은 flow-based model이라 한다.

Considerations when Designing Flow Models

Flow model을 설계할 때 고려할 점이 위에서 언급한 주요한 두 가지 이유와 관련하여 몇 가지 더 있다.

우선, prior distribution이 간단할수록 sampling과 likelihood evaluation 계산이 효율적이다. 따라서 보통 isotropic Gaussian distirbution을 선택하는 경우가 많다.

Likelihood evaluation : Evaluation of \( \mathbf{x} \mapsto \mathbf{z} \) mapping
Sampling : Evaluation of \( \mathbf{z} \mapsto \mathbf{x} \) mapping

그리고, likelihood 계산 시에는 \(n \times n\) Jacobian matrix의 determinant를 계산해야 하는데, 계산의 복잡도는 \(O(n^3)\)이다. 즉 training loop에서 매번 계산하기에는 너무 계산량이 많다.

따라서 Jacobian matrix가 triangular matrix인 invertible transformation을 많이 사용한다. Triangular matrix의 determinant는 대각 성분의 곱으로 간단하게 \(O(n)\)의 복잡도로 계산이 가능하기 때문이다.

예를 들어, transformation \(\mathbf{x}_i = f_i(\mathbf{z}) \)가 \(\mathbf{z}_{\leq i}\), 즉 이전 \(z\)에만 의존한다면, 다음과 같은 lower triangular matrix일 것이다. (이후 \(z\)에 의존하면 upper triangular matrix일 것이다.)

\( \mathbf{J} = \cfrac{\partial f}{\partial \mathbf{z}} = \begin{bmatrix} \cfrac{\partial f_1 }{\partial z_1} & \cdots & 0 \\ \vdots & \ddots & \vdots \\ \cfrac{\partial f_n}{\partial z_1} & \cdots & \cfrac{\partial f_n}{\partial z_n} \end{bmatrix} \)

다음 글에서는 Flow-based model의 예시를 살펴보자.

저작자표시 비영리 변경금지 (새창열림)

Flow-based Models (Normalizing Flow) (1)

Introduction

Change of Variables Theorem in Probability Densify Function

Jacobian of Invertible Functions

Flow-based Models

Simple Prior to Complex Data Distributions

Flow-based model이란?

Considerations when Designing Flow Models

전체 카테고리

블로그 인기글

티스토리툴바

Introduction

Change of Variables Theorem in Probability Densify Function

Jacobian of Invertible Functions

Flow-based Models

Simple Prior to Complex Data Distributions

Flow-based model이란?

Considerations when Designing Flow Models

전체 카테고리

최근 글

최근댓글

블로그 인기글

티스토리툴바

Introduction

Change of Variables Theorem in Probability Densify Function

Flow-based Models