1 Gradient Descent

1.1 Motivation

In Linear Regression, we studied how to minimize the error function \[ E = \sum_{j=1}^{N} (y_j-\sum_{s=1}^{k+1} x_{js}m_{s})^2 \] and obtained an exact solution. In other cases we will encounter later, such an exact solution is not feasible and we will have to use a method to approximate an exact solution. One of the most common methods for this purpose is gradient descent.

1.2 Basic Algorithm

Consider a function \(E: \mathbb R^n \rightarrow \mathbb R\), \({\boldsymbol{w}}=(w_1, w_2, \dots , w_n) \rightarrow E({\boldsymbol{w}})\). The gradient \(\nabla E\) of \(E\) is defined by \[ \nabla E := \left ( \frac {\partial E}{\partial w_1}, \frac {\partial E}{\partial w_2}, \dots , \frac {\partial E}{\partial w_n} \right ). \]

Proposition : Assume that \(E({\boldsymbol{w}})\) is differentiable in a neighborhood of \({\boldsymbol{w}}\). Then the function \(E({\boldsymbol{w}})\) decreases fastest in the direction of \(-\nabla E ({\boldsymbol{w}})\).

Proof: For a unit vector \({\boldsymbol{u}}\), the directional derivative \(D_{\boldsymbol{u}}E\) is given by \[ D_{\boldsymbol{u}}E= \nabla E \cdot {\boldsymbol{u}}= | \nabla E | |{\boldsymbol{u}}| \cos \theta = | \nabla E | \cos \theta ,\] where \(\theta\) is the angle between \(\nabla E\) and \({\boldsymbol{u}}\). The minimum value of \(D_{\boldsymbol{u}}E\) occurs when \(\cos \theta\) is \(-1\). \(\square\)

Set \[ {\boldsymbol{w}}_{k+1}={\boldsymbol{w}}_k - \eta \nabla E({\boldsymbol{w}}_k) \] where \(\eta >0\) is the step size or learning rate. Then \[ E({\boldsymbol{w}}_{k+1}) \le E({\boldsymbol{w}}_{k}).\] Under some moderate conditions, \[ E({\boldsymbol{w}}_k) \rightarrow \text{local minimum} \qquad \text{ as }\ k \rightarrow \infty .\]In particular, this is true when \(E\) is convex or when \(\nabla E\) is Lipschitz continuous.

1.2.1 Example

Consider \(E({\boldsymbol{w}})=E(w_1,w_2)= w_1^4+w_2^4-16w_1w_2\). Then \(\nabla E({\boldsymbol{w}})= [ 4w_1^3-16w_2, 4w_2^3-16w_1]\). Choose \({\boldsymbol{w}}_0=(1,1)\) and \(\eta =0.01\). Then \[ {\boldsymbol{w}}_{30}=(1.99995558586289, 1.99995558586289)\] and we get \[ E({\boldsymbol{w}}_{30})= -31.9999999368777. \]

We see that \({\boldsymbol{w}}_k \rightarrow (2,2)\) and \(E(2,2)=-32\). Indeed, using multi-variable calculus, one can verify that when \({\boldsymbol{w}}=(2,2)\), a local minimum of \(E({\boldsymbol{w}})\) is \(-32\).

Exercise: Using multi-variable calculus, find all the local minima of \(E({\boldsymbol{w}})\).

Figure 1: Graph of \(E({\boldsymbol{w}})\)

Figure 2: Values of \(E({\boldsymbol{w}}_k)\)

1.2.2 Example: Linear Regression revisited

We will apply gradient descent to linear regression in the lab session.

Exercise: Define \(\sigma(x) = \dfrac {e^x}{e^x+1}= \dfrac 1 {1+e^{-x}}\). In Logistic Regression we will minimize the following error function \[ E ({\boldsymbol{w}}) = - \sum_{n=1}^N \{ t_n \ln y_n + (1-t_n) \ln (1-y_n)\}, \] where we write \({\boldsymbol{w}}=(w_1, w_2, \dots , w_{k+1})\) and \(y_n=\sigma(w_1 x_{n1}+ w_2 x_{n2} + \cdots + w_k x_{nk}+w_{k+1} )\). Compute the gradient \(\nabla E({\boldsymbol{w}})\).

1.3 Newton’s Method

Let us first consider the single-variable case.

Let \(f: \mathbb R \longrightarrow \mathbb R\) be a single-variable (convex, differentiable) function.
To find a local minimum \(\Longleftrightarrow\) To find \(x^*\) such that \(f'(x^*)=0\)
Make a guess \(x_0\) for \(x^*\) and set \(x=x_0+h\).
Using the Taylor expansion, we have \[\begin{align*} f(x) &=f(x_0+h) \approx f(x_0) + f'(x_0) h+ \tfrac 1 2 f''(x_0) h^2 \\ f'(x) = \frac{df}{dh}\frac {dh}{dx} & \approx \frac d {dh} \left ( f(x_0) + f'(x_0) h+ \tfrac 1 2 f''(x_0) h^2 \right ) \\ &= f'(x_0) + f''(x_0)h \end{align*}\] If \(0 = f'(x_0) + f''(x_0)h\), we obtain \[ h = - f'(x_0)/f''(x_0). \]
We have shown that \[ x_1=x_0- f'(x_0)/f''(x_0) \] is an approximation of \(x^*\).
Repeat the process to obtain \[ \boxed{x_{k+1}=x_k- f'(x_k)/f''(x_k)}, \] and \(x_k \rightarrow x^*\) as \(k \rightarrow \infty\).

Now we consider the multi-variable case.

Let \(E({\boldsymbol{w}})\) be a multi-variable function. The Hessian matrix of \(E\) is defined by \[ \mathbf H E= \begin{bmatrix} \tfrac {\partial^2 E}{\partial w_1^2} & \tfrac {\partial^2 E}{\partial w_1 \partial w_2} & \cdots & \tfrac {\partial^2 E}{\partial w_1 \partial w_n} \\ \tfrac {\partial^2 E}{\partial w_2 \partial w_1} & \tfrac {\partial^2 E}{\partial w_2^2} & \cdots & \tfrac {\partial^2 E}{\partial w_2 \partial w_n} \\ \vdots & \vdots & \ddots & \vdots \\ \tfrac {\partial^2 E}{\partial w_n \partial w_1} & \tfrac {\partial^2 E}{\partial w_n \partial w_2} & \cdots & \tfrac {\partial^2 E}{\partial w_n^2} \end{bmatrix}. \] That is, \(\mathbf H E= [ \tfrac {\partial^2 E}{\partial w_i \partial w_j}]\).
Generalizing the single-variable case, we obtain
\[ \boxed{ {\boldsymbol{w}}_{k+1}= {\boldsymbol{w}}_k - \mathbf H E ({\boldsymbol{w}}_{k})^{-1} \nabla E({\boldsymbol{w}}_k)} . \]
Using a step size \(\eta\), the formula may be modified to be \[ \boxed{ {\boldsymbol{w}}_{k+1}= {\boldsymbol{w}}_k - \eta \mathbf H E ({\boldsymbol{w}}_{k})^{-1} \nabla E({\boldsymbol{w}}_k)} . \]
Newton’s method is much faster than Gradient Descent. However, it may be expensive to compute the inverse of the Hessian matrix.

1.3.1 Example

Consider \(E({\boldsymbol{w}})=E(w_1,w_2)= w_1^4+w_2^4-16w_1w_2\). Then \(\nabla E({\boldsymbol{w}})= [ 4w_1^3-16w_2, 4w_2^3-16w_1]^\top\).

\[\begin{align*} \mathbf H E({\boldsymbol{w}}) &= \begin{bmatrix} 12 w_1^2 & -16 \\ -16 & 12 w_2^2 \end{bmatrix} \\ \\ \mathbf H E({\boldsymbol{w}})^{-1} &= \frac 1 {9w_1^2 w_2^2 -16} \begin{bmatrix} \tfrac 3 4 w_2^2 & 1 \\ 1& \frac 3 4 w_1^2 \end{bmatrix} \\ \\ \mathbf H E^{-1} \nabla E & = \frac 1 {9w_1^2 w_2^2 -16} \begin{bmatrix} 3 w_1^3 w_2^2 -8 w_2^3 -16w_1 \\ 3 w_1^2 w_2^3 -8 w_1^3 -16w_2 \end{bmatrix} \end{align*}\]

Choose \({\boldsymbol{w}}_0=(1.2,1.2)\) and \(\eta =1\). Then \({\boldsymbol{w}}_{9}=(2.00000004189571, 2.00000004189571)\), \(E({\boldsymbol{w}}_{9})= -31.9999999999999\).

Figure 3: Values of \(E({\boldsymbol{w}}_k)\)

1.3.2 Stochastic Gradient Descent (SGD)

Typically in Machine Learning, the function \(E({\boldsymbol{w}})\) is given by a sum of the form \[ E({\boldsymbol{w}}) = \frac 1 N \sum_{n=1}^N E_n({\boldsymbol{w}}), \] where \(N\) is the number of elements in the training set. When \(N\) is large, computation of the gradient \(\nabla E\) may be expensive.

The SGD selects a sample from the training set in each iteration step instead of using the whole batch of the training set, and use \[ \frac 1 M \sum_{i=1}^M \nabla E_{n_i}({\boldsymbol{w}}) ,\] where \(M\) is the size of the sample and \(\{ n_1, n_2 , \dots , n_M \} \subset \{ 1,2, \dots , N \}\). The SGD is commonly used in many Machine Learning algorithms.