Bayesian Methods

Bernoulli Variables (Coin tossing)

Bayesian methods have a rich philosophical underpinning that I am not qualified to delve into. As I understand it, Bayesian methods view all statements about probabilities as conditional, and probability itself as a kind of personal perspective on the likelihood of events that gets updated in the face of new evidence.

The theory rests on Bayes's theorem about conditional probability, which says that if $X$ and $Y$ are random variables, then

$$ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}. $$

When applied to experiments, this formula gets interpreted as follows. Let's suppose that we are attempting to determine whether, for example, a particular coin that we've found is fair (equal probability to heads and tails). We decide to conduct an experiment to see. The ingredients of our analysis are:

  • our prior sense of what the chance is that the coin yields heads. For example, if we have no opinion whatsoever about the fairness of the coin, then our prior would be the uniform distribution on the interval $[0,1]$. However, if we have reason to think the coin is fair, then our prior distribution might be "bumped up" at $1/2$, but still allow for some chance of unfairness.
  • Some data obtained from an experiment. For example, we flip the coin 100 times and obtain 44 heads and 56 tails.

What we'd like to know is:

  • Given our prior hypothesis and the data, how should we update our sense of the distribution of the probability of a head?

The underlying parameter we are studying is $\theta$, the probability of getting heads when we flip the coin. Let's start by assuming that we have no opinion about $\theta$, so our prior distribution is uniform on $[0,1]$ For our experiment, have $44$ heads and $56$ tails. Let's let $X$ be our experiment, so that

$$ P(X|\theta) = \binom{100}{44}\theta^{44}(1-\theta)^{56}. $$

Our prior $P(Y)$ is the uniform distribution on $\theta$, so

$$ P(X=(44,56),\theta) = P(X|\theta)P(\theta) = \binom{100}{44}\theta^{44}(1-\theta)^{56} $$

Finally, the probability that $X=(44,56)$ is

$$ P(X=(44,56)) = \int_{\theta} \binom{100}{44}\theta^{44}(1-\theta)^{56} d\theta. $$

From this, we get $$ P(Y|X) = \frac{\theta^{44}(1-\theta)^{56}}{\int_{\theta} \theta^{44}(1-\theta)^{56} d\theta} $$

Remark: The Beta function $B(r,s)$ is the special function $$ B(r,s) = \int_{0}^{1} x^{r-1}(1-x)^{s-1} dx $$ so $$ P(Y|X) = \frac{1}{B(45,57)}\theta^{44}(1-\theta)^{56} $$

The probability distribution $P(Y|X)$ is called the posterior distribution and it represents our opinion about the probability of heads in light of the evidence.

In [1]:
%setup
pandas=pd, numpy=np
bokeh models, figure, layouts loaded
output directed to notebook
Loading BokehJS ...
In [2]:
from scipy.special import beta
x=np.linspace(.001,.999,100)
y = x**44*(1-x)**56/beta(45,57)

f=figure(title='Posterior distribution')
f.line(x=x,y=y)
show(f)

As this plot shows, the posterior distribution has a strong peak. A little calculus shows that the maximum of the function $$\theta^{44}(1-\theta)^{56}$$ occurs at $\theta_{0}=44/100$. This is called the maximum a posteri estimate or MAP. The other interesting quantity is the mean of the posterior distribution, which is $B(46,57)/B(45,57)=.4411$.

One more quantity we can look at is to ask: based on what we now know, what do we expect is the probability of getting heads from our coin? This is the probability using the posterior distribution: $$ P(H|\theta) = \int_{\theta}\theta P(\theta|X) = .4411. $$ which happens to be the posterior mean in this case.

Bayesian Linear Regression

Now we imagine that we have a collection of data points $D=\{(x_1,y_1),\ldots (x_n,y_n)\}$ and we propose that the the underlying relationship is linear: $$ y = mx+b+\epsilon $$ where $\epsilon$ is a normally distributed random variable with mean $0$ and variance $\sigma^2$. For simplicity, let's assume that this $\sigma^2$ is known and fixed in advance, and our problem is to estimate $m$ and $b$.

To apply Bayes's theorem, we need to compute the probability of the data given the parameters. We assume that the $y_{i}$ are drawn independently so that the overall probability is the product of the individual ones. Essentially $y_{i}$ is drawn from a normal distribution with mean $mx_{i}+b$ and variance $\sigma^{2}$ $$ P(y_{i}|m,b) = N(y_{i}|mx_{i}+b,\sigma^2) $$ and $$ P(D|m,b) = \prod_{i} N(y_{i}|mx_{i}+b,\sigma^{2}) $$

Note: $N(y|x,\sigma^{2})$ is the value of the normal distribution with mean $x$ and variance $\sigma^{2}$ at the point $y$.

We need to propose prior distributions on $m$ and $b$. Following Bishop page 153, we put normal priors of mean zero and variance $\alpha^{2}$ on both $m$ and $b$.

Then

$$ P(m,b|D) = \frac{P(D|m,b)N(m|0,\alpha^2)N(b|0,\alpha^2)}{P(D)} $$

The denominator of this fraction is an integral over $m,b$ space with the data fixed, and is necessary to normalize the right side of this equation. As it happens it is possible, with work, to compute this all explicitly and Bishop explains how. In general, however, the denominator on the right side of a Bayes expression like the one above is intractable and so another approach is needed. This is where Monte Carlo methods come in.

Monte Carlo

The idea behind MC methods is that rather than explicitly computing the posterior distribution $P(m,b)$ as in the formula above, we will use random techniques to generate a large number of samples from the probability distribution $P(m,b)$. A histogram of those samples will give us an approximation the the probability distribution and enable us to estimate the MAP and the posterior mean, as well as other parameters.

One way to do this is via Markov Chain Monte Carlo. Suppose we have a probability distribution $P$ and we want to draw samples from it. We construct a Markov Chain $x0,x1,\ldots$ in such a way that the stationary distribution of the Markov Chain is $P$. Then we run the Markov Chain for a long time until we think the dynamics are close to the stationary distribution, and we consider the steps of the chain to be samples from the distribution. Note that these are NOT really independent samples, but they are (hopefully) good enough.

One big advantage of this approach is that it turns out that all we need to construct this markov chain is the numerator of the fraction on the right hand side of the expression for $P(m,b|D)$. That's because the denominator is just a normalizing constant.

A great deal of work has been done on this problem but the classic algorithm for monte carlo sampling is called Metropolis Hastings. It was originally developed as part of the work on the atomic bomb.

Metropolis Hastings

Suppose we have a function $f(x)$ that is proportional to the probability distribution we are interested in. For example, in our coin-flipping example, $f(x)=x^{44}(1-x)^{56}$. Carry out the following sequence:

  1. Pick an arbitrary starting point x[0] and set i=0.
  2. Choose a random number y from the standard gaussian distribution centered at x[i]. (Other choices are possible, this is an example).
  3. Compute $\alpha=f(y)/f(x[i])$. This is called the acceptance ratio.
  4. Pick a random number uniformly between 0 and 1. If $u\le \alpha$, set $x[i+1]=y$. Otherwise set $x[i+1]=x[i]$.
  5. increment i and go back to step 1.

Claim: This markov chain has stationary distribution equal to the normalized version of $f(x)$.

In [3]:
from scipy.stats import beta,norm
import holoviews as hv
hv.extension('bokeh')
In [4]:
# Metropolis Hastings Test for Beta Distribution
x = np.linspace(0,1,100)
rv = beta(3,4)
ys = rv.pdf(x)
d=50
N=100000

def f(x,a=2,b=3):
    if x<0 or x>1:
        return 0
    else:
        return x**a*(1-x)**b
    
def walker(N,f):
    walk = np.zeros(N)
    walk[0] = .1
    for i in range(N-1):
        y = np.random.normal(walk[i],.1)
        alpha = f(y)/f(walk[i])
        u = np.random.uniform(0,1)
        if u < alpha:
            walk[i+1]=y
        else:
            walk[i+1]=walk[i]
    return walk[N//2:]

walk = walker(100000,f)
frequencies, edges = np.histogram(walk,d)
a=hv.Histogram((edges, d*frequencies/frequencies.sum()))
b=hv.Curve((x,ys)).opts(color='red',title='Sampled vs True Beta(3,4)',width=400,height=400)

a*b
Out[4]:

In our regression example, our "likelihood" is: $$ P(m,b|D) = \frac{P(D|m,b)N(m|0,\alpha^2)N(b|0,\alpha^2)}{P(D)} $$ Rather than multiplying a large number of numbers less than one and getting underflow, we will take the logarithm. Here $L$ denotes the numerator (the unnormalized likelihood) from this expression:

$$ \log L(m,b|D) = \sum_{i=1}^{N} \log N(y_{i}|mx_{i}+b,\sigma^2) + \log N(b|0,\alpha^2) +\log N(m|0,\alpha^{2}) $$
In [5]:
X = np.random.uniform(0,1,200)
Y = 2*X+1+np.random.normal(0,.3,size=200)
hv.Scatter((X,Y)).options(title='Some sample data for regression',width=400,height=400,xlim=(0,1),ylim=(0,3))
Out[5]:
In [6]:
# The log likelihood function
# normal, 0 centered priors for m and b; variance fixed at 1 for simplicity
def L(theta):
    m=theta[0]
    b=theta[1]
    return norm.logpdf(Y,m*X+b,1).sum() + norm.logpdf(b,0,1)+norm.logpdf(m,0,1)
In [7]:
# Metropolis Hastings Sampler
def walker(N,f):
    walk = np.zeros((2,N))
    walk[0,0]=.1
    walk[1,0]=.1
    for i in range(N-1):
        y = np.random.normal(walk[:,i],.3)
        alpha = np.exp(L(y)-L(walk[:,i]))
        u = np.random.uniform(0,1)
        if u < alpha:
            walk[:,i+1]=y
        else:
            walk[:,i+1]=walk[:,i]
    return walk[:,N//2:]
In [8]:
walk=walker(10000,L)
In [9]:
hv.Scatter((walk[0,:],walk[1,:])).options(xlabel = 'm',ylabel='b',show_grid=True,xlim=(0,3),ylim=(0,3),width=400,height=400,alpha=.1,title='Scatterplot of (m,b) distribution')
Out[9]:
In [10]:
freq,edge = np.histogram(walk[1,:],20)
freq1, edge1 = np.histogram(walk[0,:],20)
hv.Histogram((edge,freq)).options(show_grid=True,xlabel='b',title='Mean ={:.2f}'.format(walk[1,:].mean()))+hv.Histogram((edge1,freq1)).options(show_grid=True,xlabel='m',title='Mean={:.2f}'.format(walk[0,:].mean()))
Out[10]:

Another look at Metropolis Hastings

Let's look at the Metropolis Hastings algorithm in a discrete situation to see why it works. Suppose we have a finite set $E=\{1,2,3,\ldots,n\}$ and a probability $p(i)$ for $1\le i\le n$. How does M-H look in this case?

For the "proposal distribution", choose the uniform distribution (instead of the gaussian we used above). So the algorithm looks like this:

  1. Start with x0=1 and i=0.
  2. Pick a number $y$ from $1$ to $n$ uniformly at random.
  3. Let $\alpha=p(n)/p(x[i])$.
  4. Choose a number $u$ between 0 and 1 uniformly. If $u<\alpha$, then $x[i+1]=y$. Otherwise $x[i+1]=x[i]$. Increment i.
  5. Go to step 2.

Claim: This Markov Chain has $p(i)$ as a stationary distribution.

Proof: Let $M$ be the $n\times n$ transition probability matrix, so that $M_{ij}$ is the probability of moving from point $i$ to point $j$. We show that $p(i)$ satisfies the detailed balance equations:

$$ p(i)M_{ij} =p(j) M_{ji}. $$

If this holds, we have $$ \sum_{i=1}^{n} p(i)M_{ij} = \sum_{i=1}^{n} p(j)M_{ji} = p(j)\sum_{i=1}^{n} M_{ji} = p(j) $$ since $\sum_{i=1}^{n}M_{ji}$ is the probability of moving from $j$ to somewhere, which is $1$. So the transition matrix $M_{ij}$ has $p$ as an (left) eigenvector with eigenvalue 1, so $p$ is a stationary distribution.

Now we compute that

$$ p(i)M_{ij} = p(i)\frac{1}{n} \min\{1,p(j)/p(i)\} = \left\{ \begin{matrix}\frac{1}{n} p(j) & p(j)\le p(i) \cr \frac{1}{n} p(i) & p(i)\le p(j)\end{matrix}\right. $$

and

$$ p(j)M_{ji} = p(j)\frac{1}{n}\min\{1,p(i)/p(j)\} = \left\{ \begin{matrix}\frac{1}{n} p(i) & p(i)\le p(j) \cr \frac{1}{n} p(j) & p(j)\le p(i)\end{matrix}\right. $$

and these are equal.

Remark: Instead of the uniform distribution, all we need is a "proposal" distribution that has the property that the chance of moving from i to j is the same as that of moving from j to i. Then the calculation of the detailed balance equations still works.

More on MCMC

Metropolis-Hastings is an old fashioned monte carlo method and there has been a lot of work on improving it. Also, there are packages which automate the MCMC solution of this type of problem. The two I have tried are stan and pymc3.

In [ ]: