from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.models import Range1d
output_notebook()

Gradient Descent¶

See Why Momentum Really Works for a beautiful explanation of gradient descent and momentum. These notes are just my interpretation of the first section of that paper. Later sections come later....

We are given a smooth function $f:\mathbf{R}^{N}\to \mathbf{R}$ and our objective is to find a minimum of this function. The strategy is to exploit this theorem from multivariate Calculus.

Theorem: The gradient $\nabla f$ is a vector field that, at each point $x\in\mathbf{R}^{N}$, points in the direction of most rapid increase in $f$; and $-\nabla f$ points in the direction of most rapid decrease.

Our strategy for gradient descent is:

Gradient Descent: Let $\alpha$ be a positive real number, and choose $w_{0}\in\mathbf{R}^{N}$ "at random" as a starting place. Iteratively define $$ w_{n+1} = w_{n} - \alpha\nabla f(w_{n}). $$ Then the sequence $w_{n}$ converges to a (local) minimum of $f$.

def f(x):
    return x**2
def Df(x):
    return 2*x
alpha=.1
N=10
x=np.zeros(N)
x[0]=3
for i in range(1,N):
    x[i] = x[i-1]-alpha*Df(x[i-1])

f = figure(title='Gradient Descent Illustration f(x)=x**2',width=300,height=300)
f.xaxis.axis_label='Step Number'
f.yaxis.axis_label='Value of x_n'
f.y_range=Range1d(0,3)
f.line(range(N),x)
show(f)

Consider $f(x)=x^2$. We can make this very explicit in this case:

the iteration is $x_{n+1}=x_{n}-2\alpha x_{n} = (1-2\alpha)x_{n}$.
so $x_{n}=(1-2\alpha)^{n}x_{0}$ As long as the learning rate satisfies $0\le \alpha\le .5$ this converges exponentially to zero.

Following the article cited above, let's look at the problem of finding the minimum of a quadratic function. This is a common situation in the first place -- think about least squares regression, for example -- but in the neighborhood of a critical point a function is approximately quadratic so the quadratic case sheds light on the general situation as well.

Let $A$ be a symmetric invertible $N\times N$ matrix and let $b$ be a vector in $\mathbf{R}^{N}$. Define $$ f(w) = \frac{1}{2} w^{\intercal}Aw -b^{\intercal}w. $$ This is exactly the type of function that appears in a least squares regression problem. More precisely, suppose we have data points $$ (y_1,x_{11},x_{12},\ldots, x_{1k}), (y_2, x_{21},\ldots, x_{2k}), \ldots, (y_{N},x_{N1},\ldots, x_{Nk}) $$ and we want to fit a (multi-)linear function to this data to estimate $y$ as a function of the $x_i$.

Assemble the data $x_{ij}$ into a matrix $X$ and the points $y_i$ into a vector $Y$. Our goal is to find a $k\times 1$ vector $w$ so that $$ E=\|Y-Xw\|^2 $$ is minimal. But $$\begin{aligned} E(w) &= (Y^{\intercal}-w^{\intercal}X^{\intercal})(Y-Xw) \\ &= Y^{\intercal}Y - Y^{\intercal}Xw - w^{\intercal}X^{\intercal}Y + w^{\intercal}X^{\intercal}Xw\\ \end{aligned} $$

Each term in this sum is a $1\times 1$ matrix and:

$Y^{\intercal}Y$ is a constant
$Y^{\intercal}Xw = w^{\intercal}X^{\intercal}Y$ since the transpose of a $1x1$ matrix is the matrix itself.

As a minimization problem, we can forget about the constant, set $b=X^{\intercal}Y$, and minimize the function $$ f(w) = -b^{\intercal}w + \frac{1}{2}w^{\intercal}Aw $$ where $A=X^{\intercal}X$.

Proposition: The gradient $\nabla f$ of the function $f(w)$ is $\nabla f = Aw-b$.

Proof: First, $$ \frac{\partial}{\partial w_{i}} b^{\intercal}w = b_{i} $$ so assembling these we see that the gradient of the first term is indeed $-b$.

We will skip the computation for the second part which is straightforward if you keep your summations well-organized.

Corollary: The minimum occurs at $w_*=A^{-1}b$.

The gradient descent iteration then becomes $$ w_{n+1} = w_{n} - \alpha( Aw_{n} -b). $$

From linear algebra, we can find $Q$ so that $A=Q\Lambda Q^{\intercal}$ where $\Lambda$ is the diagonal matrix of eigenvalues that we assume ordered from smallest $\lambda_1$ to largest $\lambda_n$. We set $x=Q^{\intercal}(w-w_*)$, then the coordinates $x_i$ satisfy:

the center of our coordinate system is at the minimum $A^{-1}b$, and
the coordinate axes are the eigenvectors of the matrix $A$.

Also, $Q$ is an orthogonal matrix with $QQ^{T}=I$.

With this change of coordinates, the gradient descent process "diagonalizes" and becomes $$ x_{n+1} = (1-\alpha\Lambda)x_{n} = (1-\alpha\Lambda)^{n}x_{0} $$ and $$ f(w_{n})-f(w_{*}) = \frac{1}{2}x_{n}^{\intercal}\Lambda x_{n}. $$

The thing to observe here is that the different components converge at different rates, corresponding to the different $(1-\alpha\lambda_{i}).$ and (dropping the factor of $1/2$) the error behaves like

$$ (1-\alpha\Lambda)^{2n}(x_{0}^{\intercal}\Lambda x_{0}). $$

The norm of the error is

$$ E_{k} = \sum_{i=1}^{N} (1-\alpha\lambda_{i})^{2k}\lambda_{i}(x_{0}^{(i)})^2 $$

where $x_{0}^{(i)}$ is the $i^{th}$ component of the initial guess in the $x$-coordinates.

An example¶

Let's look at an example. For simplicity, we will start with a diagonal matrix. The argument above shows that this captures the general case. For concreteness, let's have eigenvalues $[.2,2,5]$. The function we are trying to minimze by gradient descent is just the quadratic form $f(x_1,x_2,x_3) = .2x_1^2+2x_2^2+5x_3^2$.

If we choose a step size/learning rate of $\alpha$, then the iteration we are interested in is $$ x_i^{(n)} = (1-\alpha\lambda_i)^n x_{i}^{0} $$ and the error -- which in this case is just norm of the value of the function -- at the nth stage is $$ E_n = \sum_{i=1}^{3}(1-\alpha\lambda_i)^{2n}\lambda_i (x_i^{0})^2 $$ where $x_i^{(0)}$ is our initial guess -- let's use $(1,1,1)$.

def generate_data(alpha, steps):
    colors=['blue','green','black']
    f=figure(width=400,height=400)
    mark=False
    Nsave=np.inf
    f.y_range=Range1d(0,1)
    data = np.zeros(shape=(3,steps))
    data[:,0] = np.array([1,1,1])
    Lambda = np.diag([.2,2,5])
    E = np.zeros(shape=(1,steps))
    E[:,0] = np.dot(np.dot(data[:,0],Lambda),data[:,0])
    Z = np.identity(3) - alpha*Lambda
    for i in range(1,steps):
        data[:,i] = np.dot(Z, data[:,i-1])
        E[:,i] = np.dot(np.dot(data[:,i],Lambda),data[:,i])
        if np.abs(E[:,i])<.01 and mark==False:
            Nsave = i
            mark=True
    for j in range(3):
        f.line(range(steps),data[j,:],color=colors[j],legend_label='component {}'.format(j))
    f.line(range(steps),E[0,:],color='red',legend_label='Error',line_width=3,line_dash='dotted')
    f.title.text = 'Alpha = {}, Steps to convergence = {}'.format(alpha,Nsave)
    f.legend.background_fill_alpha=0
    f.legend.location='center_right'
    return f

show(generate_data(.01,1000))

Our best choice of $\alpha$ has to make $(1-\alpha\lambda_i)^n$ go to zero as fast as possible. So certainly we need $|(1-\alpha\lambda_i)|<1$ or $0<\alpha\lambda_i<2$.

The problem becomes: Choose $\alpha$ to simultaneously make all of $|(1-\alpha\lambda_{i})|$ as small as possible. Since $0<\alpha\lambda_n<2$ we have $0<\alpha<2/\lambda_n$.

x=np.linspace(0,2/5,100)
e1 = np.abs(1-.2*x)
e2 = np.abs(1 -2*x)
e3 = np.abs(1 - 5*x)
f=figure(width=400,height=400)
f.line(x,e1,legend_label='lambda=.2',color='blue')
f.line(x,e2,legend_label='lambda=2',color='green')
f.line(x,e3,legend_label='lambda=5',color='black')
f.legend.background_fill_alpha=0
f.legend.location='bottom_left'
f.scatter(x=[2/5.2],y=[np.abs(1-(2/5.2)*5)],size=10,color='red')
f.title.text="Red Dot indicates optimum"
f.xaxis.axis_label='alpha'
f.yaxis.axis_label='1-alpha*lambda'
show(f)

At the extremes, we have $1-\alpha\lambda_1$ and $1-\alpha\lambda_n$.

Our compromise is to make $1-\alpha\lambda_1 = -(1-\alpha\lambda_n)$. This makes the two extremes go to zero at the same rate, and everything in the middle go to zero at least as fast. This means that the optimum $\alpha$ is

$$ \alpha_* = \frac{2}{(\lambda_1+\lambda_n)}. $$

show(generate_data(2/(.2+5),100))

The ratio $\kappa=\lambda_n/\lambda_1$ is called the condition number of the matrix that we are studying. The optimal learning rate is $\alpha = 2/(\lambda_1+\lambda_n)$ and the associated (multiplier) rate is: $$ (1-\alpha\lambda_1) = (1-2\lambda_1/(\lambda_1+\lambda_n)) = (\lambda_n-\lambda_1)/(\lambda_n+\lambda_1) = (\kappa-1)/(\kappa+1) $$

If $\kappa$ is large, this is close to $1$ and convergence is slow. If $\kappa$ is just a bit larger than $1$, then this is small and convergence is fast.