1 Gradient Descent

1.1 Motivation

In Linear Regression, we studied how to minimize the error function \[ E = \sum_{j=1}^{N} (y_j-\sum_{s=1}^{k+1} x_{js}m_{s})^2 \] and obtained an exact solution. In other cases we will encounter later, such an exact solution is not feasible and we will have to use a method to approximate an exact solution. One of the most common methods for this purpose is gradient descent.

1.2 Basic Algorithm

Consider a function \(E: \mathbb R^n \rightarrow \mathbb R\), \({\boldsymbol{w}}=(w_1, w_2, \dots , w_n) \rightarrow E({\boldsymbol{w}})\). The gradient \(\nabla E\) of \(E\) is defined by \[ \nabla E := \left ( \frac {\partial E}{\partial w_1}, \frac {\partial E}{\partial w_2}, \dots , \frac {\partial E}{\partial w_n} \right ). \]

Proposition : Assume that \(E({\boldsymbol{w}})\) is differentiable in a neighborhood of \({\boldsymbol{w}}\). Then the function \(E({\boldsymbol{w}})\) decreases fastest in the direction of \(-\nabla E ({\boldsymbol{w}})\).

Proof: For a unit vector \({\boldsymbol{u}}\), the directional derivative \(D_{\boldsymbol{u}}E\) is given by \[ D_{\boldsymbol{u}}E= \nabla E \cdot {\boldsymbol{u}}= | \nabla E | |{\boldsymbol{u}}| \cos \theta = | \nabla E | \cos \theta ,\] where \(\theta\) is the angle between \(\nabla E\) and \({\boldsymbol{u}}\). The minimum value of \(D_{\boldsymbol{u}}E\) occurs when \(\cos \theta\) is \(-1\).     \(\square\)

Set \[ {\boldsymbol{w}}_{k+1}={\boldsymbol{w}}_k - \eta \nabla E({\boldsymbol{w}}_k) \] where \(\eta >0\) is the step size or learning rate. Then \[ E({\boldsymbol{w}}_{k+1}) \le E({\boldsymbol{w}}_{k}).\] Under some moderate conditions, \[ E({\boldsymbol{w}}_k) \rightarrow \text{local minimum} \qquad \text{ as }\ k \rightarrow \infty .\]In particular, this is true when \(E\) is convex or when \(\nabla E\) is Lipschitz continuous.

1.2.1 Example

Consider \(E({\boldsymbol{w}})=E(w_1,w_2)= w_1^4+w_2^4-16w_1w_2\). Then \(\nabla E({\boldsymbol{w}})= [ 4w_1^3-16w_2, 4w_2^3-16w_1]\). Choose \({\boldsymbol{w}}_0=(1,1)\) and \(\eta =0.01\). Then \[ {\boldsymbol{w}}_{30}=(1.99995558586289, 1.99995558586289)\] and we get \[ E({\boldsymbol{w}}_{30})= -31.9999999368777. \]

We see that \({\boldsymbol{w}}_k \rightarrow (2,2)\) and \(E(2,2)=-32\). Indeed, using multi-variable calculus, one can verify that when \({\boldsymbol{w}}=(2,2)\), a local minimum of \(E({\boldsymbol{w}})\) is \(-32\).

Exercise: Using multi-variable calculus, find all the local minima of \(E({\boldsymbol{w}})\).

Figure 1: Graph of \(E({\boldsymbol{w}})\)
Figure 2: Values of \(E({\boldsymbol{w}}_k)\)

1.2.2 Example: Linear Regression revisited

We will apply gradient descent to linear regression in the lab session.

Exercise: Define \(\sigma(x) = \dfrac {e^x}{e^x+1}= \dfrac 1 {1+e^{-x}}\). In Logistic Regression we will minimize the following error function \[ E ({\boldsymbol{w}}) = - \sum_{n=1}^N \{ t_n \ln y_n + (1-t_n) \ln (1-y_n)\}, \] where we write \({\boldsymbol{w}}=(w_1, w_2, \dots , w_{k+1})\) and \(y_n=\sigma(w_1 x_{n1}+ w_2 x_{n2} + \cdots + w_k x_{nk}+w_{k+1} )\). Compute the gradient \(\nabla E({\boldsymbol{w}})\).

1.3 Newton’s Method

Let us first consider the single-variable case.

Now we consider the multi-variable case.

1.3.1 Example

\[\begin{align*} \mathbf H E({\boldsymbol{w}}) &= \begin{bmatrix} 12 w_1^2 & -16 \\ -16 & 12 w_2^2 \end{bmatrix} \\ \\ \mathbf H E({\boldsymbol{w}})^{-1} &= \frac 1 {9w_1^2 w_2^2 -16} \begin{bmatrix} \tfrac 3 4 w_2^2 & 1 \\ 1& \frac 3 4 w_1^2 \end{bmatrix} \\ \\ \mathbf H E^{-1} \nabla E & = \frac 1 {9w_1^2 w_2^2 -16} \begin{bmatrix} 3 w_1^3 w_2^2 -8 w_2^3 -16w_1 \\ 3 w_1^2 w_2^3 -8 w_1^3 -16w_2 \end{bmatrix} \end{align*}\]

Figure 3: Values of \(E({\boldsymbol{w}}_k)\)

1.3.2 Stochastic Gradient Descent (SGD)

Typically in Machine Learning, the function \(E({\boldsymbol{w}})\) is given by a sum of the form \[ E({\boldsymbol{w}}) = \frac 1 N \sum_{n=1}^N E_n({\boldsymbol{w}}), \] where \(N\) is the number of elements in the training set. When \(N\) is large, computation of the gradient \(\nabla E\) may be expensive.

The SGD selects a sample from the training set in each iteration step instead of using the whole batch of the training set, and use \[ \frac 1 M \sum_{i=1}^M \nabla E_{n_i}({\boldsymbol{w}}) ,\] where \(M\) is the size of the sample and \(\{ n_1, n_2 , \dots , n_M \} \subset \{ 1,2, \dots , N \}\). The SGD is commonly used in many Machine Learning algorithms.