Nonlinear Optimization (2) - Gaussian-Newton Method, Levernberg-Marquardt Method

Xiang Gao, Tao Zhang 저자의 'Introduction to Visual SLAM from Theory to Practice' 책을 통해 SLAM에 대해 공부한 내용을 기록하고자 한다.

책의 내용은 주로 'SLAM의 기초 이론', 'system architecture', 'mainstream module'을 다룬다.

포스팅에는 기초 이론만 다룰 것이며, 실습을 해보고 싶다면 다음 github를 참조하자.

GitHub - gaoxiang12/slambook2: edition 2 of the slambook

edition 2 of the slambook. Contribute to gaoxiang12/slambook2 development by creating an account on GitHub.

github.com

이전 포스팅에서 State estimation(MAP, MLE problem)을 어떻게 least-square problem으로 바꿀 수 있는지 알아보았다. 그 내용은 다음 링크를 참조하자.

https://jjuke-brain.tistory.com/entry/Nonlinear-Optimization-Least-square-Problem-Gauss-Newton-method-Levenburg-Marquardt-Method?category=987160

Nonlinear Optimization (1) - Batch State Estimation into Least-square Problem

Xiang Gao, Tao Zhang 저자의 'Introduction to Visual SLAM from Theory to Practice' 책을 통해 SLAM에 대해 공부한 내용을 기록하고자 한다. 책의 내용은 주로 'SLAM의 기초 이론', 'system architecture', 'ma..

jjuke-brain.tistory.com

Nonlinear Least-square Problem

간단한 least-square problem을 살펴보자.

\( \underset{\mathbf{x}}{\min}F(\mathbf{x}) = \cfrac{1}{2} \lVert f(\mathbf{x}) \rVert_2^2 \)

여기서 \(f(\mathbf{x})\)는 residual(잔차), 즉 error의 합과 관련된 term으로, state variable \( \mathbf{x} \in \mathbb{R}^n \)에 대해 \( f(\mathbf{x}) : \mathbb{R}^n \mapsto \mathbb{R} \)을 만족하는 scalar nonlinear function이다.

\(\cfrac{1}{2}\)은 미분을 간단히 해주기 위한 coefficient로, 결과에 영향을 주지는 않는다. Objective function \(F\)의 도함수가 0벡터라면 최적의 \(\mathbf{x}\)값을 찾을 수 있을 것이다. (scala function에서 극값을 찾는 것과 같은 개념이다.)

\( \cfrac{ \mathrm{d} F}{ \mathrm{d} \mathbf{x}} = \boldsymbol{0} \)

위 식을 풀면 minimum, maximum, saddle point를 얻을 수 있다. 식은 매우 간단해 보이지만, \(f\)가 어떤 함수인가에 따라 푸는 난이도가 매우 어려워질 수 있다.

식을 푸는 데에는 objective function의 global properties를 알아야 하는데, 보통 이를 알 수 없는 경우가 많다. 이렇게 직접적으로 least-square problem을 풀기 힘든 경우, iterated methods를 사용한다.

Iterated methods란, 초기값에서 시작하여 현재 추정치를 업데이트하는 과정을 반복하여 objective function이 감소하도록 하는 것을 말한다.

구체적인 과정은 다음과 같다.

initial value \( \mathbf{x}_0 \)가 주어진다.
\(k\)번째 iteration에 대해, object function \( \lVert f (\mathbf{x}_k + \Delta \mathbf{x}_k) \rVert_2^2 \)이 더 작은 값을 갖도록 하는 incremental value(증가량) \( \Delta \mathbf{x}_k \)를 찾는다.
\( \Delta \mathbf{x}_k \) 값이 충분히 작아지면 알고리즘을 종료하고, 아니면 \( \mathbf{x}_{k+1} = \mathbf{x}_k + \Delta \mathbf{x}_k \)에 대해 2번 과정을 반복한다.

도함수를 구해서 해를 구해야하는 문제에서, 훨씬 간단하게 increments \(\Delta \mathbf{x}_k\)를 감소시키는 문제로 바꿔주었다.

이후에 objective function이 선형 근사(linearly approximated)될 수 있기 때문에 increment 계산 방식이 더 간단함을 보일 것이다.

증가량이 매우 작아질 때까지 함수가 감소하게 될 때, 알고리즘이 converge(수렴)한다고 본다. 이때 문제는 각 iteration에서 증가량을 어떻게 찾을 것인가(local problem)이다. 여기서는 global properties가 아닌, 해당 iteration에서 \(f\)의 local properties만 고려하면 된다.

The First and Second-order Method

\( \Delta \mathbf{x}_k \)를 어떻게 찾는지 알아보자.

\(k\)번째 iteration에서, \( \mathbf{x}_k \)에서의 increment \( \Delta \mathbf{x}_k \)를 찾으려 한다. 가장 직관적인 방법은 \( \mathbf{x}_k\)에 대한 object function의 Taylor expansion을 사용하는 것이다.

\( F (\mathbf{x}_k + \Delta \mathbf{x}_k) \approx F(\mathbf{x}_k) + \mathbf{J}(\mathbf{x}_k)^T \Delta \mathbf{x}_k + \cfrac{1}{2} \Delta \mathbf{x}_k^T \mathbf{H}(\mathbf{x}_k) \Delta \mathbf{x}_k \)

\(\mathbf{J}\)는 Jacobian( \(\mathbf{x}\)에 대한 \(F(\mathbf{x})\)의 1차 derivative)이고, \( \mathbf{H} \)는 Hessian(2차 derivative)이다. Taylor expansion 중 1차/2차까지만 취하여 그 해를 구한 것을 first-order/second-order method라 한다.

First-order Method

\( \F(\mathbf{x}_k + \Delta \mathbf{x}_k) \approx F(\mathbf{x}_k) + \mathbf{J}(\mathbf{x}_k)^T \Delta \mathbf{x}_k \)

가장 간단한 방법으로 1차항만 취하여 음의 gradient 방향으로 increment를 취하여 함수를 감소시킬 수 있다.

\( \Delta \mathbf{x}^* = - \mathbf{J}(\mathbf{x}_k) \)

물론 이는 방향을 나타낼 뿐이므로 parameter \( \lambda \)를 사용하여 step length를 설정해준다.

딥러닝에 등장하는 gradient descent라는 방법과 같고, 다음과 같은 식을 갖는다.

\( \mathbf{x}_{k+1} = \mathbf{x}_k + \Delta \mathbf{x}^* = \mathbf{x}_k - \lambda \mathbf{J}(\mathbf{x}_k) \)

책에서는 이 방법을 steepest descent method라 한다.

직관적인 의미를 살펴보자면, gradient 반대 방향으로 움직이면서 object function은 1차(선형) 근사에 따라 감소할 것이다.

이러한 방법은 \(k\)번째 iteration에서만 유효하고, \(k\)에 대한 정보를 포함하지 않는다는 점을 유의하자.

Newton's Method

이번에는 Taylor expansion의 2차항까지 고려해보자. 그러면 increment equation은 다음과 같을 것이다.

\( \Delta \mathbf{x}^* = \operatorname{argmin} \left( F(\mathbf{x}) + \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} + \cfrac{1}{2} \Delta \mathbf{x}^T \mathbf{H} \Delta \mathbf{x} \right) \)

우변은 \( \Delta \mathbf{x} \) term에 대한 0차, 1차, 2차항을 포함한다는 것을 알 수 있다.

\( \Delta \mathbf{x} \)의 도함수를 찾아 \(\boldsymbol{0}\)으로 만들게 되면 다음과 같은 과정을 거친다.

\( \mathbf{J} + \mathbf{H} \Delta \mathbf{x} = \boldsymbol{0} \Rightarrow \mathbf{H} \Delta \mathbf{x} = - \mathbf{J} \)

위와 같은 linear equation을 풀고 increment 값을 얻는 것을 Newton's method라 한다.

위 두 가지 방법들은 Taylor expansion을 사용하여 계산을 진행하므로 아주 직관적이며, 쉬워 보인다. 실제로 원래의 object function이 지역적으로(locally) 1차 혹은 2차 함수 형태라면 두 방법이 항상 유효하다.

하지만, 다음과 같은 단점을 갖고 있다.

Steepest descent method는 너무 greedy하고, 발산하기 쉬우며, 반복이 많이 필요하다.
Newton's method는 \( \mathbf{H} \)를 계산해야 하는데, 목적함수가 복잡해지면 시간이 너무 오래걸린다.
- 이로 인해 보통 \( \mathbf{H} \)를 계산하는 것은 피하는 경향이 있다. 이에 따라 quasi-Newton method 등의 방법이 고안되기도 했다.

Least-square problem을 해결하는 데에는 좀 더 practical한 method를 사용한다. 대표적인 두 가지는 Gauss-Newton's method와 Levernburg-Marquardt's method이다.

The Gauss-Newton Method

Gauss-Newton method는 \( f(\mathbf{x}) \)의 first-order Taylor expansion을 사용하여 간단하면서도 objective function \(F(\mathbf{x})\) 대신 residual인 \(f(\mathbf{x})\)를 사용하여 좋은 성능을 나타낸다.

\( f(\mathbf{x} + \Delta \mathbf{x}) \approx f(\mathbf{x}) + \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} \)

여기서 \( \mathbf{J} (\mathbf{x})\)는 Jacobian matrix가 아닌 \(\mathbf{x}\)에 대한 \(f(\mathbf{x})\)의 1차 편도함수를 요소로 갖는 column vector이다. 최종 목표는 \( \lVert f (\mathbf{x} + \Delta \mathbf{x}) \rVert^2 \)가 minimum에 도달하도록 하는 increment \( \Delta \mathbf{x} \)를 찾는 것이다. 이러한 \( \Delta \mathbf{x} \)를 찾기 위해, 아래와 같은 linear least-square problem을 풀어야 한다.

\( \Delta \mathbf{x}^* = \underset{\Delta \mathbf{x}}{\operatorname{argmin}} \cfrac{1}{2} \lVert f(\mathbf{x}) + \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} \rVert^2 \)

이전에는 objective function \(F(\mathbf{x})\)에 직접 Taylor expansion을 적용하여 계산했지만, Gauss-Newton method에서는 residual인 \(f(\mathbf{x})\)를 object function으로 설정하여 Taylor expansion을 적용한다는 점이 다르다.

따라서 아래와 같은 과정을 거친다.

\( \begin{align*} \cfrac{1}{2} \lVert f(\mathbf{x}) + \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} \rVert^2 &= \cfrac{1}{2} \left( f(\mathbf{x}) + \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} \right)^T \left( f(\mathbf{x}) + \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} \right) \\ &= \cfrac{1}{2} \left( \lVert f(\mathbf{x}) \rVert_2^2 + 2 f(\mathbf{x}) \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} + \Delta \mathbf{x}^T \mathbf{J}(\mathbf{x}) \mathbf{J}(\mathbf{x})^T \Delta \mathbf{x} \right) \end{align*} \)

위 식에서 \( \Delta \mathbf{x} \)에 대한 도함수를 찾고, \(\boldsymbol{0}\)으로 두면 다음과 같은 식을 얻을 수 있다.

\( \mathbf{J}(\mathbf{x}) f (\mathbf{x}) + \mathbf{J}(\mathbf{x}) \mathbf{J}^T(\mathbf{x}) \Delta \mathbf{x} = \boldsymbol{0} \)

\( \mathbf{J}(\mathbf{x}) \mathbf{J}^T(\mathbf{x}) \Delta \mathbf{x} = - \mathbf{J} (\mathbf{x}) f (\mathbf{x}) \)

이 식은 variable \( \Delta \mathbf{x} \)에 대한 선형 방정식이고, normal equation 혹은 Gauss-Newton equation이라 한다. 이러한 normal equation을 푸는 것이 전반적인 optimization problem의 핵심이다.

\( \mathbf{J}(\mathbf{x}) \mathbf{J}^T(\mathbf{x}) \) term이 Newton's method의 Hessian matrix 개념을 대체하고, 우변은 \(g(\mathbf{x})\)로 둔다. 따라서 \( \mathbf{H} \)의 계산을 하지 않아도 된다.

Gauss-Newton method의 알고리즘은 다음과 같다.

\( \mathbf{x}_0 \)를 initial value로 설정한다.
\(k\)번째 iteration에 대해, jacobian \(\mathbf{J}(\mathbf{x}_k)\)와 residual \(f(\mathbf{x}_k)\)를 계산한다.
normal equation \( \mathbf{H} \Delta \mathbf{x}_k = \mathbf{g} \)를 푼다.
\( \Delta \mathbf{x}_k \)가 충분히 작다면, 알고리즘을 종료하고, 아니면 \( \mathbf{x}_{k+1} = \mathbf{x}_k + \Delta \mathbf{x}_k \)로 두고 2번 과정으로 돌아간다.

increment \(\Delta \mathbf{x}_k\)를 잘 구한다면, object function이 잘 감소할 것이다.

해를 구하기 위해서는 \( \mathbf{H}^{-1} \), 즉 \( \mathbf{J} \mathbf{J}^T \)의 inverse를 구해야 하는데, \( \mathbf{J} \mathbf{J}^T \)는 semi-positive definite이기 때문에 inverse가 존재하지 않는다.

알고리즘이 수렴하지 않는 경우는 다음과 같이 두 경우로 나뉜다.

\(\mathbf{J}\)가 null-space를 갖는다면, 즉 \( \mathbf{J} \Delta \mathbf{x} = \boldsymbol{0} \)을 만족하는 non-zero \( \Delta \mathbf{x}\)를 찾을 수 있다면, 해를 찾을 수 없다.
- 이 때에는 알고리즘이 수렴(converging)하지 않는다. 직관적으로 해당 점에서 원래 함수의 local approximation이 2차 함수가 아닐 것이다.
Step size \(\Delta \mathbf{x}\)가 매우 크다면 마찬가지로 해당 점에서의 선형 근사가 정확하지 않게 되어서 발산할 수 있다.

Gauss-Newton method는 위와 같은 단점을 갖고 있지만 간단하면서도 효과적인 nonlinear optimization 방법이어서 배울 가치가 있다. Nonlinear optimization의 알고리즘에 Gauss-Newton method의 변환이 사용된다.

예를 들어, line search method 중 하나는 step size \(\alpha\)를 추가하여 \( \lVert f( \mathbf{x} + \alpha \Delta \mathbf{x}) \rVert^2 \)를 최소화하도록 한다. 그리고 Levenberg-Marquardt method는 Gauss-Newton method를 확장하여 더 느리지만 강인한 알고리즘이다. (Damped Newton method라고도 한다.)

그렇다면, Levenberg-Marquardt method에 대해 알아보자.

The Levenberg-Marquardt Method

Gauss-Newton method에서는 Taylor expansion의 2차 근사를 사용했는데, 이는 expansion point 근처에서만 좋은 근사 결과를 보인다. 따라서 근사한 함수에 대한 trust region을 정의할 필요가 있다.

Trust region이란, 2차 근사가 유효한 범위를 말한다. 이렇게 trust-region 개념을 포함하는 알고리즘을 trust-region method라 한다.

이러한 trust region을 결정하는 방법은 실제 object function과 근사 함수의 결과의 차이를 고려하는 것이다. (당연히 차이가 적을수록 근사가 유효한 것이다.) 이를 나타내는 indicator \(\rho\)를 아래와 같이 정의한다.

\( \rho =\cfrac{f(\mathbf{x} + \Delta \mathbf{x}) + f(\mathbf{x})}{\mathbf{J}(\mathbf{x})^T \Delta \mathbf{x}} \)

분자는 실제 object function의 감소량을, 분모는 approximate model의 감소량을 나타낸다. \(\rho\)가 1에 가깝다면 근사가 잘된 것이다.

값이 작을수록 실제 감소량이 근사 모델보다 작으므로 trust-region을 줄여야 한다는 것이고, 반대로 값이 크면 실제 감소가 더 크다는 의미이므로 approximate range를 늘려야 한다.

따라서 알고리즘은 다음과 같다.

initial value \(\mathbf{x}_0\)과 initial trust-region radius \(\mu\)를 설정한다.
k번째 반복에서, trust-region이 추가된 Gauss-Newton method에 기반하여 linear problem을 푼다.
- \( \underset{\Delta \mathbf{x}_k}{\min{\cfrac{1}{2} \lVert f(\mathbf{x}_k) + \mathbf{J}(\mathbf{x}_k)^T \Delta \mathbf{x}_k \rVert^2}}, \quad \text{s.t.} \lVert \mathbf{D} \Delta \mathbf{x}_k \rVert^2 \leq \mu, \)
- 여기서, \(\mu\)는 radius, \( \mathbf{D} \)는 coefficient matrix이다.
위의 \(\rho\)를 정의한 식에 따라 \(\rho\)를 계산한다.
\(\rho > \cfrac{3}{4}\)이면 \( \mu = 2 \mu\)로, \(\rho < \cfrac{1}{4}\)이면 \(\mu = 0.5 \mu \)로 둔다.
\( \rho\)가 threshold보다 커지면 \(\mathbf{x}_{k+1} = \mathbf{x}+k + \Delta \mathbf{x}_k\)로 둔다.
수렴하지 않았으면 2번 과정을 반복하고, 아니면 결과를 반환한다.

2번 과정의 식에서 \(\mu\)를 반지름으로 갖는 구 내부로 increment를 제한한다. 즉, 구를 벗어나면 근사가 유효하지 않게 된다. 또한 non-negative diagonal matrix \(\mathbf{D}\)를 통해 타원체(ellipsoid) 내부로 제한한다.

보통 \( \mathbf{J}^T \mathbf{J} \)의 대각 요소들의 square root를 \( \mathbf{D}\)로 사용하여 graident가 작은 차원에서 제한 범위가 더 커지도록 하였다.

\( \mathbf{D} = \text{diag} \left( \mathbf{J}^T \mathbf{J} \right)^{1/2} \)

그런데, Levenberg-Marquardt optimization에서는 gradient를 구하기 위해 2번 식과 같은 sub-problem을 풀어야 하는데, 이 문제는 inequality constraints를 갖는다. 제한 조건이 있는 최적화문제를 풀기 위해서는 Lagrangian multiplier \(\lambda\)를 추가하여 Lagrangian function으로 만든 후에 푼다.

\( \mathcal{L}(\Delta \mathbf{x}_k, \lambda) =\cfrac{1}{2} \lVert f(\mathbf{x}_k) + \mathbf{J}(\mathbf{x}_k)^T \Delta \mathbf{x}_k \rVert^2 + \cfrac{\lambda}{2} \left( \lVert \mathbf{D} \Delta \mathbf{x}_k \rVert^2 - \mu \right) \)

Gauss-Newton method와 비슷하게, \(\Delta \mathbf{x}\)에 대한 Lagrangian function의 도함수를 0으로 두어 풀면 minimum value를 구할 수 있게 된다.

\( \left( \mathbf{H} + \lambda \mathbf{D}^T \mathbf{D} \right) \Delta \mathbf{x}_k = \mathbf{g} \)

여기서 \(\lambda\) parameter가 작으면 해당 범위에서 \(\mathbf{H}\)의 영향이 커지므로 quadratic approximation model이 더 좋은 결과를 보인다. (Gauss-Newton method와 비슷해진다.)

반면 값이 크면 \(\lambda \mathbf{D}^T \mathbf{D}\)가 dominant해진다. 즉, Gradient descent(steepest descent) 방법과 더 유사해진다.

이 방법을 통해 coefficient matrix에 대한 non-singular, ill-conditioned problem을 피할 수 있고, 더 정확하고 안정적인 increments \(\Delta \mathbf{x} \)를 얻을 수 있다

요약하자면, Levenberg method(\( \mathbf{D}^T \mathbf{D} = \mathbf{I} \))를 통해 singular matrix 문제를 해결하고, Levenberg-Marquardt 방법에서는 \( \mathbf{D}^T \mathbf{D} \) \( \mathbf{J}^T \mathbf{J}\)의 diagonal을 추가하여 증감을 결정할 수 있다.

저작자표시 비영리 변경금지 (새창열림)

Nonlinear Optimization (2) - Gaussian-Newton Method, Levernberg-Marquardt Method

Nonlinear Least-square Problem

The First and Second-order Method

First-order Method

Newton's Method

The Gauss-Newton Method

The Levenberg-Marquardt Method

전체 카테고리

블로그 인기글

티스토리툴바

Nonlinear Least-square Problem

The First and Second-order Method

First-order Method

Newton's Method

The Gauss-Newton Method

The Levenberg-Marquardt Method

전체 카테고리

최근 글

최근댓글

블로그 인기글

티스토리툴바