Jekyll2020-02-20T20:10:41+00:00https://jeremy9959.github.io/Blog/feed.xmlCleverly Titled BlogNotes on mathematics, bioinformatics, and other stuffJeremy TeitelbaumUpdates2020-02-20T00:00:00+00:002020-02-20T00:00:00+00:00https://jeremy9959.github.io/Blog/Updates<h2 id="where-am-i">Where am I?</h2>
<p>This blog has been quiet for a while now but there are still things happening,
they’re just reported in other places.</p>
<h3 id="bokeh">bokeh</h3>
<p>I’ve done some work on <a href="http://bokeh.pydata.org">bokeh</a>. In particular:</p>
<ul>
<li>a simulation of <a href="http://polyas-urn.herokuapp.com">Polya’s Urn</a></li>
<li>code that creates the structure graph of a bokeh model and <a href="https://gist.github.com/jeremy9959/984e2c09182a8ab0da692d04ab6c2e8a">draws it using bokeh</a></li>
<li>a <a href="http://bokehthemebuilder.herokuapp.com">theme builder</a> (work in progress) for bokeh</li>
</ul>
<h3 id="notes-for-ml">Notes for ML</h3>
<p>As part of my Spring project course on machine learning I’ve written some notes on fundamental things:</p>
<ul>
<li><a href="http://jeremy9959.net/Math-5800-Spring-2020/notebooks/BiasVariance.html">Bias and Variance</a></li>
<li><a href="http://jeremy9959.net/Math-5800-Spring-2020/notebooks/PolynomialRegression.html">Polynomial Regression</a></li>
<li><a href="http://jeremy9959.net/Math-5800-Spring-2020/notebooks/CurseOfDimensionality.html">Curse of Dimensionality</a></li>
<li><a href="http://jeremy9959.net/Math-5800-Spring-2020/notebooks/PrecisionRecall.html">Precision, Recall, and ROC</a></li>
<li><a href="http://jeremy9959.net/Math-5800-Spring-2020/notebooks/PCA.html">Principal Components</a></li>
</ul>Jeremy TeitelbaumPointers to what's happening elsewherePolya’s Urn2020-01-07T00:00:00+00:002020-01-07T00:00:00+00:00https://jeremy9959.github.io/Blog/Polya<h2 id="polyas-urn">Polya’s Urn</h2>
<p>Following up on the previous discussion of deFinetti’s Theorem, <a href="https://polyas-urn.herokuapp.com">this app</a> simulates
the convergence of the Polya’s urn process to the Beta distribution.</p>Jeremy Teitelbauman app that simulates Polya's UrndeFinetti’s Theorem Part II2019-12-09T00:00:00+00:002019-12-09T00:00:00+00:00https://jeremy9959.github.io/Blog/deFinetti2<h2 id="some-further-references-on-definettis-theorem">Some further references on deFinetti’s Theorem</h2>
<ul>
<li>P. Diaconis, <a href="http://statweb.stanford.edu/~sabatti/Stat370/synthese.pdf">Finite Forms of deFinetti’s Theorem on Exchangeability</a></li>
</ul>
<p>This paper shows how the problem of exchangeability can be interpreted geometrically and proves some “approximate” versions of the
theorem for finite sequences.</p>
<ul>
<li>G. R. Wood, <a href="https://projecteuclid.org/download/pdf_1/euclid.aop/1176989684">Binomial Mixtures and Finite Exchangeability</a>.</li>
</ul>
<p>This paper considers the problem of extending a finite exchangeable sequence to an infinite one; and also the problem of
determining how likely it is that a randomly chosen distribution on $0,\ldots, n$ will be a mixture of binomial distributions.
The questions turn out to be closely related. The paper gives a formula computing the fraction of binomial mixtures
among all distributions (which turns out to go to zero VERY quickly with increasing $n$) and then relates that to
the probability that an $n$ exchangeable sequence is infinitely extendible (which also goes to zero very quickly with $n$).</p>
<p>The results amount to some (non-trivial!) calculations of volumes of regions in simplices.</p>
<p>Apparently as of the time of this paper there was a conjecture due to Crisma giving a formula for the probability that an
exchangeable sequence of $n$ random variables can be extended to one of length $r$ that was unsolved, what is its current status?</p>
<ul>
<li>P. Diaconis and D. Freedman, <a href="https://projecteuclid.org/download/pdf_1/euclid.aop/1176994663">Finite Exchangeable Sequences</a></li>
</ul>
<p>This paper computes the distance between the distribution of $k$ exchangeable random variables taking values in a (finite) set $S$
and the closest mixture of IID random variables. To explain (one of) their theorems, represent a
probability distribution on $S$ as a point in the $|S|$-dimensional simplex. Suppose $\mu$ is a probability distribution on this simplex.
Given such a probability distribution $\mu$, let $P(k,\mu)$ be the distribution on $S$ from $\mu$, and then making $k$ iid choices from this $\mu$.</p>
<p><strong>Theorem:</strong>
Let $S$ be a finite set with $|S|$ elements. Let $P$ be an exchangeable probability on $S^{n}$. Then there is a probability $\mu$
on the $|S|$-simplex so that $| P_{k}-P(k,\mu ) | \le 2|S| k/n$ for all $k\le n$. Here $P_{k}$ is the marginal probability of $P$ on sequences of length $k\le n$.</p>
<p>In other words, if $P$ is a distribution on sequences of length $k$ that can be extended to an exchangeable sequence of length $n$, then
$P$ is within distance $k/n$ of a “mixture.”</p>
<p>The distance here is “variation distance” $|P-Q|=2\sup_A|P(A)-Q(A)|$ over Borel sets $A$.</p>Jeremy Teitelbaumsome links to follow up results on deFinetti's TheoremdeFinetti’s Theorem2019-12-05T00:00:00+00:002019-12-05T00:00:00+00:00https://jeremy9959.github.io/Blog/deFinetti<h2 id="de-finettis-theorem">De Finetti’s Theorem</h2>
<p>De Finetti’s theorem is a fundamental result in Bayesian probability and is closely related to the theory of the
Dirichlet Distribution and the Dirichlet Process which arise in clustering.</p>
<p>For the first part of this post we follow the lovely paper <a href="https://arxiv.org/pdf/1809.00882.pdf">An elementary proof of de Finetti’s Theorem</a> by <a href="https://www.fernuni-hagen.de/stochastik/team/werner.kirsch.shtml">Werner Kirsch</a>.</p>
<p><strong>Theorem:</strong> (de Finetti) Let $S=(X_1,X_2,\ldots)$ be an infinite sequence of $0,1$-valued random variables
that form an “exchangeable sequence,”, meaning that $P(X_1=a_1,X_2=a_2,\ldots, X_N=a_N)$ is invariant under permutation
of the $X_i$ for any $N$. Then there exists a probability measure $\mu$ on $[0,1]$ such that, for any $N$ and any sequence
of zeros and ones $a_1,\ldots, a_N$, we have</p>
<script type="math/tex; mode=display">P(X_1=a_1,\ldots, X_N=a_N)=\int y^e(1-y)^s d\mu(y)</script>
<p>where $e$ is the number of $a$’s equal to $1$ and $s$ is the number of $a$’s equal to zero (so $e+s=N$).</p>
<p>This result is usually interpreted by saying that the infinite exchangeable sequence is a mixture of iid Bernoulli random
variables with mixture measure $\mu$ – so that to sample from the distribution
one first “picks” a probability $p$ from the distribution $\mu$ and then does $N$
flips of a bernoulli coin with heads probability $p$.</p>
<p>The fact that we have an <em>infinite</em> sequence of random variables is crucial for this result. See for example
Persi Diaconis’s early, and elementary, paper
<a href="http://statweb.stanford.edu/~sabatti/Stat370/synthese.pdf">Finite forms of de Finetti’s theorem on exchangeability</a> for the beginning
of a long story on the relationship between finite and infinite sets of exchangeable random variables.</p>
<p>To get a sense of how this all works, notice first of all that the “exchangeability condition” means that all of the
information in the sequence $S$ can be encoded in numbers $P(k,m)$ for $m\in\mathbb{N}$ and $0\le k\le m$ where</p>
<p>$P(k,m)$ is the probability of getting $k$ ones from a choice of $m$ variables $X_i$ from $S$.</p>
<p>In fact, suppose we have a collection $(a_1,\ldots, a_m)$ of zeros and ones, and $k=\sum a_i$ is the number of ones. Then</p>
<script type="math/tex; mode=display">P(X_{i_1}=a_1,X_{i_2}=a_2,\ldots,X_{i_m}=a_m) = \frac{P(k,m)}{\binom{m}{k}}.</script>
<p>since exchangeability means that the location of the $k$ $1$’s among the $m$ slots is uniformly distributed among the
$\binom{m}{k}$ possible sites.</p>
<p><strong>Lemma:</strong> The probability measure $\mu$ defined by the function $P(k,m)\to \mathbb{R}$ described above
is uniquely determined by the values $P(n,n)$.</p>
<p><strong>Proof:</strong> The fact that the $P(k,m)$ determine a measure on the full product space means that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}
P(X_1=a_1,\ldots,X_n=a_n)& = & P(X_1=a_1,\ldots,X_n=a_n,X_{n+1}=0) \cr
& & +P(X_1=a_1,\ldots,X_n=a_n,X_{n+1}=1).
\end{array} %]]></script>
<p>Therefore</p>
<script type="math/tex; mode=display">\frac{1}{\binom{n}{k}}P(k,n)=\frac{1}{\binom{n+1}{k}}P(k,n+1) + \frac{1}{\binom{n+1}{k+1}}P(k+1,n+1).</script>
<p>Rearranging and simplifying we obtain the relation</p>
<script type="math/tex; mode=display">P(k,n+1) = \frac{n+1}{n+1-k}P(k,n) - \frac{k+1}{n+1-k}P(k+1,n+1).</script>
<p>From this relation, if we know $P(k,n)$ for all $n<N$ and $P(N,N)$, we can recursively compute
$P(k,N)$ for $0\le k\le N-1$ to obtain all $P(k,N)$.</p>
<p><strong>Remark:</strong> You can’t specify the $P(k,k)$ at will; they have to yield a probability measure where
all the $P(k,n)$ are in $[0,1]$ and this imposes complicated and hidden conditions.</p>
<p>Notice also that the random variable</p>
<script type="math/tex; mode=display">S_N = \frac{1}{N}\sum_{i=1}^{N} X_i</script>
<p>counts the proportion of $1$’s among the first $N$ of the $X_i$, so its distribution is governed by the $P(k,m)$,
so that for $0\le k\le N$ we have $S_N$ supported on the fractions $k/N$ and</p>
<script type="math/tex; mode=display">P(S_N=k/N) = P(k,N).</script>
<h2 id="kirschs-proof">Kirsch’s Proof</h2>
<p>The strategy of the proof is to exploit some theoretical results relating convergence of moments to weak convergence of measures.
This is a somewhat subtle point, since the polynomials don’t belong to
the bounded continuous functions so can’t be used directly as “test functions” for weak convergence.</p>
<p><strong>1.</strong> If $\nu$ is any probability measure on $\mathbb{R}$, and $\mu$ is a probability measure with support contained in $[a,b]$,
and if $\nu$ and $\mu$ have the same moments, then $\nu=\mu$.</p>
<p><strong>Proof:</strong> This is <a href="https://www.fernuni-hagen.de/stochastik/downloads/momente.pdf">Theorem 2.55</a> on page 28 of Kirsch’s book
on moments. That theorem asserts that if $\mu$ is a bounded measure with “moderately growing moments” and $\nu$ is a bounded
measure $\nu$ such that $\mu$ and $\nu$ have the same moments, then they are equal. Moderate growth means that the $k^{th}$
moment is bounded by $AC^{k}/k!$ for constants $A$ and $C$ and all $k$. If $\mu$ is compactly supported, as in our case,
then this condition is automatic. The proof compares the characteristic functions (Fourier transforms)
of $\mu$ and $\nu$. The proof also relies on Prohorov’s theorem.</p>
<p><strong>2.</strong> If $\mu_n$ form a sequence of probability measures on $[0,1]$
all of whose moments $m_k(\mu_n)$ converge to some $m_k$ as $n\to\infty$, then the $\mu_n$ converge weakly to a unique probability
measure on $[0,1]$ with moments $m_k$.</p>
<p><strong>Proof:</strong> This is a consequence of <a href="https://www.fernuni-hagen.de/stochastik/downloads/momente.pdf">Theorem 2.56</a> on page 28 of Kirsch’s book (which is quite a bit more general). The sequence of measures $\mu_n$ has bounded moments (even bounded second moment is enough),
so it is <em>tight</em>. Since all are probability measures, we know that the sequence $\mu_n$ has a weakly convergent subsequence by
Prohorov’s theorem. Take any such convergent subsequence and let $\mu$ be the limit of that subsequence. The moments of $\mu$ are the limits $m_k$. Any other convergent subsequence has a limit $\mu’$ which has the same moments $m_k$, so $\mu=\mu’$. In other words,
every convergent subsequence of the $\mu_n$ has the same limit, so the sequence $\mu_n$ converges (weakly) to $\mu$.</p>
<p><strong>Proposition:</strong> The random variables $S_N$ converge in distribution to a probability measure $\mu$ on $[0,1]$ as $N\to\infty$
with moments
<script type="math/tex">\int y^{k}d\mu(y) = P(k,k).</script></p>
<p><strong>Proof:</strong> Given <strong>1</strong> and <strong>2</strong> above, we need to check that the moments of the distribution $\mu_N$ determined by $S_N$ converge
to $P(k,k)$ as claimed. In other words, we need to compute
<script type="math/tex">\lim_{N\to\infty} E((\frac{1}{N}\sum_{i=1}^{N} X_i)^{k}) = \lim_{N\to\infty}\frac{1}{N^{k}}\sum_{i_1,\ldots,i_k} E(X_{i_1}X_{i_2}\cdots X_{i_k})</script>
Since the individual random variables $X_i$ are idempotent, and because of exchangeability, to work this out we need to count
how many times each monomial equivalent to $X_1 X_2\cdots X_r$ for $1\le r\le k$ occurs in the sum. For case where all
$X_i$ are distinct, we have the number of ways to choose a subset of size $k$ from $N$ elements, or $\binom{N}{k}$; and then
each subset contributes $k!$ terms. Thus the number of terms of this form grows with leading term $N^{k}$.</p>
<p>For all of the other situations, we need to count subsets of size $k$ chosen from $N$ elements with <em>at least one repetition.</em>
To estimate the number of these, we choose at most $k-1$ indices between $1$ and $N$ for a total of choices on the order
of $N^{k-1}$; and then we make a list of length $k$ of choices from this set with $k-1$ elements, for a total of at most
$(k-1)^kN^{k-1}$ choices. (In fact this is a massive overcount but we only care about the leading order of $N$).</p>
<p>It follows that, as $N\to\infty$, only the terms with $k$ distinct $X_i$ survive; and in the limit we obtain</p>
<script type="math/tex; mode=display">\lim_{N\to\infty} E((\frac{1}{N}\sum_{i=1}^{N} X_i)^{k}) = E(X_1 X_2\cdots X_k).</script>
<p>Finally, the product $X_1 X_2\cdots X_k$ is zero unless all of the $X_i$ are $1$, and thus this expectation is exactly $P(k,k)$.</p>
<p>This proof, and the general results above, tell us that the De Finetti measure $\mu$ is characterized by the $P(k,k)$:</p>
<script type="math/tex; mode=display">\int y^{k} d\mu = P(k,k).</script>
<p>To complete the proof,
observe that if we set</p>
<script type="math/tex; mode=display">\frac{1}{\binom{m}{k}}P(k,m) = \int y^{k}(1-y)^{m-k} d\mu ,</script>
<p>where the binomial coefficient appears since we need to consider all possible ways of getting $k$ ones in $m$ tries,
we have the identity</p>
<script type="math/tex; mode=display">\int y^{k}(1-y)^{m+1-k}d\mu = \int y^{k}(1-y)^{m-k} d\mu - \int y^{k+1}(1-y)^{m-k}d\mu</script>
<p>and so the probabilities determined by this limiting $\mu$ satisfy the same recurrence as the $P(k,m)$ and agree on $P(k,k)$
and are therefore the same.</p>Jeremy Teitelbaumsome notes on deFinetti's Theorem and related topicsBernoulli Mixture2019-11-05T00:00:00+00:002019-11-05T00:00:00+00:00https://jeremy9959.github.io/Blog/BernoulliDemo<h3 id="bernoulli-mixture-following-bishops-pattern-recognition-and-machine-learning-section-933">Bernoulli Mixture following Bishop’s <a href="https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf">Pattern Recognition and Machine Learning</a> Section 9.3.3.</h3>
<p>The code below gives a basic implementation of the Bernoulli Mixture fit using the EM algorithm. The essential equations are 9.57 and 9.58 in Bishop. We apply it to the fashion-mnist dataset which has 10 classes, but we look for 20 classes and pick up variations among the purses, etc. The resulting figure is basically a version of Figure 9.10 in Bishop.
Different runs of this code produce different results!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># %load bernoulli.py
</span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">log</span>
<span class="kn">from</span> <span class="nn">scipy.special</span> <span class="kn">import</span> <span class="n">logsumexp</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">product</span>
<span class="k">class</span> <span class="nc">BernoulliMixture</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_clusters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">n_iter</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">tolerance</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">,</span> <span class="n">alpha1</span><span class="o">=</span><span class="mf">1e-6</span><span class="p">,</span> <span class="n">alpha2</span><span class="o">=</span><span class="mf">1e-6</span><span class="p">):</span>
<span class="s">'''sets things up'''</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_n_clusters</span> <span class="o">=</span> <span class="n">n_clusters</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_n_iter</span> <span class="o">=</span> <span class="n">n_iter</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_tolerance</span> <span class="o">=</span> <span class="n">tolerance</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_alpha1</span> <span class="o">=</span> <span class="n">alpha1</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_alpha2</span> <span class="o">=</span> <span class="n">alpha2</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="bp">self</span><span class="o">.</span><span class="n">_n_clusters</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_n_clusters</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_P</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">Mu</span><span class="p">,</span> <span class="n">Theta</span><span class="p">):</span>
<span class="s">'''computes the log of the conditional probability of the latent variable given the data and Mu, Theta'''</span>
<span class="n">ll</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">log</span><span class="p">(</span><span class="n">Mu</span><span class="p">))</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">x</span><span class="p">,</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">Mu</span><span class="p">))</span>
<span class="n">Z</span> <span class="o">=</span> <span class="p">(</span><span class="n">log</span><span class="p">(</span><span class="n">Theta</span><span class="p">)</span><span class="o">+</span> <span class="n">ll</span> <span class="o">-</span> <span class="n">logsumexp</span><span class="p">(</span><span class="n">ll</span><span class="o">+</span><span class="n">log</span><span class="p">(</span><span class="n">Theta</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="k">return</span> <span class="n">Z</span>
<span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="s">'''carries out the EM iteration'''</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_n_samples</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">_n_features</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">shape</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="mf">.25</span><span class="p">,</span> <span class="mf">.75</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">_n_clusters</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">_n_features</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_n_features</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">_n_clusters</span><span class="p">)</span>
<span class="n">N</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_n_samples</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_n_iter</span><span class="p">):</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_P</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">Mu</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span><span class="p">))</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">V</span><span class="o">/</span><span class="n">V</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">R</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">W</span><span class="p">)</span>
<span class="n">Q</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_old</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Mu</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="p">(</span><span class="n">R</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">_alpha1</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">Q</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">_n_features</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">_alpha1</span><span class="p">),</span> <span class="p">(</span><span class="n">Q</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">_alpha2</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">N</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">_n_clusters</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">_alpha2</span><span class="p">)</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">allclose</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_old</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span><span class="p">):</span>
<span class="k">return</span>
<span class="k">def</span> <span class="nf">predict_proba</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="s">'''computes the conditional probability giving cluster membership'''</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_P</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">Mu</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">generate</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">'''generates data from the distribution'''</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">multinomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">Theta</span><span class="o">.</span><span class="n">ravel</span><span class="p">())</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">Mu</span><span class="p">),</span><span class="bp">self</span><span class="o">.</span><span class="n">Theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
<span class="k">return</span> <span class="n">z</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">data_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/fashion-mnist_train.csv'</span><span class="p">,</span><span class="n">nrows</span><span class="o">=</span><span class="mi">5000</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">data_df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span><span class="mi">1</span><span class="p">:]</span><span class="o">.</span><span class="n">values</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">data</span> <span class="o">//</span> <span class="mi">128</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">BernoulliMixture</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">M</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">set_size_inches</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">12</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">j</span> <span class="ow">in</span> <span class="n">product</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)):</span>
<span class="k">if</span> <span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="n">j</span><span class="o"><</span><span class="mi">20</span><span class="p">:</span>
<span class="n">ax</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">M</span><span class="o">.</span><span class="n">Mu</span><span class="p">[:,</span> <span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">28</span><span class="p">,</span><span class="mi">28</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">ax</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'trial.png'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/BernoulliDemo_2_0.png" /></p>Jeremy TeitelbaumBernoulli Mixture following Bishop, Section 9.3.3Another look at EM2019-10-31T00:00:00+00:002019-10-31T00:00:00+00:00https://jeremy9959.github.io/Blog/EMagain<h2 id="another-look-at-expectation-maximization-and-gaussian-mixtures--long-winded">Another look at expectation maximization and gaussian mixtures – long winded!</h2>
<p>We will consider the one dimensional case. To make things concrete we will work with the famous “old faithful” dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="s">'seaborn'</span><span class="p">)</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s">'figure.figsize'</span><span class="p">]</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span>
</code></pre></div></div>
<h2 id="a-look-at-the-data">A look at the data</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'old_faithful.csv'</span><span class="p">,</span><span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">ax</span><span class="o">=</span><span class="n">data</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'eruptions'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'waiting'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/emagain_4_0.png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ax</span><span class="o">=</span><span class="n">data</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/emagain_5_0.png" /></p>
<p>The goal is to fit a mixture of two gaussians to (say) the eruptions data. Such a model is given by five parameters:</p>
<ul>
<li>$\theta$ is the weight attached to one of the gaussians, $1-\theta$ is the other weight</li>
<li>$\mu_0,\sigma_0$ are the mean and standard deviation of the first gaussian</li>
<li>$\mu_1,\sigma_1$ are the mean and standard deviation of the other.</li>
</ul>
<p>We will use $\mu,\sigma$ to denote the pairs of means and deviations for simplicity.</p>
<p>For future reference, recall that the normal distribution is</p>
<script type="math/tex; mode=display">N(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-\mu)^2}{2\sigma^2}}</script>
<p>The distribution function for the mixture is</p>
<script type="math/tex; mode=display">P(x|\theta,\mu,\sigma) = \theta N(x|\mu_0,\sigma_0) + (1-\theta) N(x|\mu_1,\sigma_1)</script>
<p>The first step of expectation maximization is to introduce a “latent” or hidden variable $z$ which can take values $0$ or $1$. The events in the expanded model with explicit variables are pairs $(x,0)$ or $(x,1)$ corresponding to whether $x$ arose from one or the other underlying gaussians. Let $p(z)=\theta$ if $z=0$ or $1-\theta$ if $z=1$. Then
the joint probability distribution for $(x,z)$ is:</p>
<script type="math/tex; mode=display">% <![CDATA[
P((x,z)|\theta,\mu,\sigma)=\begin{cases}
\theta N(x|\mu_0,\sigma_0) & z=0 \cr
(1-\theta) N(x|\mu_1,\sigma_1) & z=1\cr
\end{cases} %]]></script>
<p>The marginal probability of $x$ is the mixture:</p>
<script type="math/tex; mode=display">P(x|\theta,\mu,\sigma) = \sum_z P((x,z)|\theta,\mu,\sigma) = \theta N(x|\mu_0,\sigma_0) + (1-\theta) N(x|\mu_1,\sigma_1)</script>
<p>The conditional probabilities are going to be relevant, so let’s look at them.
\begin{eqnarray}
P(z|x,\theta,\mu,\sigma) &=& \frac{P((x,z)|\theta,\mu,\sigma)}{P(x|\theta,\mu,\sigma)} \cr
&=& \frac{p(z)N(x|\mu_z,\sigma_z)}{\sum_z P((x,z)|\theta,\mu,\sigma)}
\end{eqnarray}</p>
<p>For the other conditional probability:</p>
<script type="math/tex; mode=display">P(x|z,\theta,\mu,\sigma) = \frac{P((x,z)|\theta,\mu,\sigma)}{P(z|\theta,\mu,\sigma)}</script>
<p>which boils down to</p>
<script type="math/tex; mode=display">% <![CDATA[
P(x|z,\theta,\mu,\sigma) = \begin{cases} N(x|\mu_0,\sigma_0) & z=0 \cr N(x|\mu_1,\sigma_1) & z=1 \cr\end{cases} %]]></script>
<p>Now let’s go back to the actual data. Expectation maximization is an iterative algorithm that begins with more or less arbitary parameters and then successively improves them. The goal is to find parameters $\theta,\mu,\sigma$ that
make the data most likely. For clarity, we will write $\mathbf{x}$ and $\mathbf{z}$ to be the vector of data points
$(x_1,\ldots, x_n)$ and a vector of “assignments” $(z_1,\ldots, z_n)$.</p>
<p>The trick is to consider a probability distribution $q(\mathbf{z})$ on the vector of assignments of each data point to a cluster. For any such distribution we have a tautological result:</p>
<p>\begin{eqnarray}
\log P(\mathbf{x}|\theta,\mu,\sigma) &=& \sum_{\mathbf{z}}q(\mathbf{z})\log P(\mathbf{x}|\theta,\mu,\sigma) \cr
&=& \sum_{\mathbf{z}}q(\mathbf{z})\log\frac{P(\mathbf{x},\mathbf{z}|\theta,\mu,\sigma)}{P(\mathbf{z}|\mathbf{x},\theta,\mu,\sigma)} \cr
\end{eqnarray}</p>
<p>and this can be rearranged to yield</p>
<script type="math/tex; mode=display">P(\mathbf{x}|\theta,\mu,\sigma) = \sum_{\mathbf{z}}q(\mathbf{z})\log\frac{P((\mathbf{x},\mathbf{z})|\theta,\sigma,\mu)}{q(\mathbf{z})} -\sum_{\mathbf{z}}q(\mathbf{z})\log\frac{P(\mathbf{z}|\mathbf{x},\sigma,\mu,\theta)}{q(\mathbf{z})}</script>
<p>Each of the two terms on the right have a particular role in the EM algorithm. Let’s given them names.</p>
<script type="math/tex; mode=display">\mathcal{L}(\theta,\mu,\sigma) = \sum_{\mathbf{z}}q(\mathbf{z})\log\frac{P((\mathbf{x},\mathbf{z})|\theta,\mu,\sigma)}{q(\mathbf{z})}</script>
<script type="math/tex; mode=display">KL(q||P) = -\sum_{\mathbf{z}}q(\mathbf{z})\log\frac{P(\mathbf{z}|\mathbf{x},\sigma,\mu,\theta)}{q(\mathbf{z})}</script>
<p>The $KL$ term is the “Kullblack-Leibler divergence” between the conditional probability on the vector of cluster assignments
and the <em>a priori</em> chosen distribution $q(\mathbf{z})$. This divergence has the property that it is always greater than or equal to zero, and it is zero only when $q(\mathbf{z})=P(\mathbf{z}|\mathbf{x},\sigma,\mu,\theta)$. <em>This is where Jensen’s lemma is applied, because the non-negativity of KL is essentially Jensen’s lemma.</em></p>
<p>The EM strategy works like this. First, we choose $q(\mathbf{z})$ to be $P(\mathbf{z}|\mathbf{x},\sigma,\mu,\theta)$. That forces
the KL divergence term to zero. Now we have the log-likelihood of the data equal to $\mathcal{L}$. We can find $\theta’,\mu’,\sigma’$ which maximize $\mathcal{L}$. For those parameters, we have</p>
<script type="math/tex; mode=display">\log P(\mathbf{x}|\theta',\sigma',\mu')=\mathcal{L}(\theta',\mu',\sigma')+KL'>\log P(\mathbf{x}|\theta,\mu,\sigma).</script>
<p>The conditional probabilities having changed, the $KL$ term is again non-zero so we can repeat the argument with a new $q$ set to the new conditional probabilities and we can again drive the log-likelihood up.</p>
<p>So the key element here is how to maximize $\mathcal{L}(\theta,\mu,\sigma)$ when</p>
<script type="math/tex; mode=display">q(\mathbf{z})=P(\mathbf{z}|\mathbf{x},\theta,\mu,\sigma).</script>
<p>To look at this, first notice that</p>
<script type="math/tex; mode=display">\mathcal{L}(\theta,\mu,\sigma)=\sum_{\mathbf{z}}q(\mathbf{z})\log P((\mathbf{x},\mathbf{z})|\theta,\mu,\sigma)-\sum_{\mathbf{z}}q(\mathbf{z})\log q(\mathbf{z})</script>
<p>For the purposes of maximizing $\mathcal{L}$ in this process, $q(\mathbf{z})$ is a constant, so we only need to look at the first term. Using our formulae above, the sum over $\mathbf{z}$ is the sum over all vectors of $1$’s and $0$’s of length $n$, with the probability of a vector given by
independent choices with the probability of a zero in the $i^{th}$ position being $P(z=0|x=x_i,\theta,\mu,\sigma)$ which is computed above. In other words, we are summing over all possible assignments of points $x_i$ to one cluster or the other.</p>
<p>In fact we are looking at an expectation over $\mathbf{z}$ of the sum of independent random variables $(x_i,z_i)$ so that result
is the sum of the expectationsl</p>
<p>Since for a single pair $(x,z)$ we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\log P((x,z)|\theta,\mu,\sigma)=\begin{cases} \log(\theta)-\log(\sigma_0)-(1/2)\log(2\pi)-\frac{(x-\mu_{0})^2}{2\sigma_0^{2}} & z=0 \cr
\log(1-\theta)-\log(\sigma_1)-(1/2)\log(2\pi)-\frac{(x-\mu_1)^2}{2\sigma_1^2} & z=1\cr\end{cases} %]]></script>
<p>we have</p>
<p>\begin{eqnarray}
E_{z}\log P((x,z)|\theta,\mu,\sigma) &=& [p_z\log(\theta)+(1-p_z)\log(1-\theta)] - [p_z\log(\sigma_0) +(1-p_z)\log(\sigma_1)]\cr
&& -p_z((x-\mu_0)^2/2\sigma_0^2)-(1-p_z)((x-\mu_1)^2/2\sigma_1^2)+C
\end{eqnarray}</p>
<p>Summing this over the coordinates, and taking the relevant derivatives with respect to $\theta$, $\sigma_0$, $\sigma_1$, $\mu_0$ and $\mu_1$ we get the following equations.</p>
<p>Write $q(i)$ for the $i^{th}$ component of the conditional probabilities that were used to construct the distribution $q(\mathbf{z})$:</p>
<script type="math/tex; mode=display">q(i) = P(z=0|x_i,\theta,\mu,\sigma)</script>
<script type="math/tex; mode=display">\sum_{i=1}^{n} \frac{q(i)}{\theta}+\frac{1-q(i)}{1-\theta} = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^{n} \frac{q(i)(x_i-\mu_0)}{\sigma_{0}^2}=0</script>
<script type="math/tex; mode=display">\sum_{i=1}^{n} -\frac{q(i)}{\sigma_{0}}+\frac{q(i)((x_i-\mu_0)^2)}{\sigma_0^3}=0</script>
<p>with two more equations for $\mu_1$ and $\sigma_1$ replacing $q(i)$ with $1-q(i)$.</p>
<p>Let’s let $n_0=\sum_{i=1}^{n} q(i)$ and $n_1=n-n_0$. Then we get the following formulae for the “new”
$\theta,\mu,\sigma$:</p>
<script type="math/tex; mode=display">\theta = n_0/n</script>
<script type="math/tex; mode=display">\mu_0 = \frac{\sum_{i=1}^{n} q(i)x_i}{n_0}</script>
<script type="math/tex; mode=display">\mu_1 = \frac{\sum_{i=1}^{n} (1-q(i))x_i}{n_1}</script>
<script type="math/tex; mode=display">\sigma_0^2 = \frac{\sum q(i)(x-\mu_0)^2}{n_0}</script>
<script type="math/tex; mode=display">\sigma_1^2 = \frac{\sum (1-q(i))(x-\mu_1)^2}{n_1}</script>
<p>Now let’s try this with the data. For initial values, let’s set initial parameters.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">'eruptions'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">theta</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="n">mu_0</span><span class="p">,</span> <span class="n">sigma_0</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span>
<span class="n">mu_1</span><span class="p">,</span> <span class="n">sigma_1</span> <span class="o">=</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>This mixture is not that great, let’s look.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[<matplotlib.lines.Line2D at 0x7ff77b472d10>]
</code></pre></div></div>
<p><img src="/Blog/assets/images/emagain_26_1.png" /></p>
<p>With these parameters, we compute the conditional probabilities that give the distribution $q(\mathbf{z})$.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">P</span> <span class="o">=</span> <span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</code></pre></div></div>
<p>P is an estimate of the chance that that a point belongs to one of the two clusters; a quick look shows that P does seem to split the data into two points.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ax</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">P</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/emagain_30_0.png" /></p>
<p>Now we can update the parameters according to our formulae.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">N</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">N0</span> <span class="o">=</span> <span class="n">P</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">N1</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">P</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">N0</span><span class="o">/</span><span class="n">N</span>
<span class="n">mu0</span> <span class="o">=</span> <span class="p">(</span><span class="n">P</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="n">N0</span>
<span class="n">mu1</span> <span class="o">=</span> <span class="p">((</span><span class="mi">1</span><span class="o">-</span><span class="n">P</span><span class="p">)</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="n">N1</span>
<span class="n">sigma0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">P</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">mu0</span><span class="p">))</span><span class="o">/</span><span class="n">N0</span><span class="p">)</span>
<span class="n">sigma1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">((</span><span class="mi">1</span><span class="o">-</span><span class="n">P</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">mu1</span><span class="p">))</span><span class="o">/</span><span class="n">N1</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s look at the new mixture.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="n">mu0</span><span class="p">,</span><span class="n">sigma0</span><span class="p">)</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="n">mu1</span><span class="p">,</span> <span class="n">sigma1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">),</span><span class="n">linewidth</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[<matplotlib.lines.Line2D at 0x7ff78143b090>]
</code></pre></div></div>
<p><img src="/Blog/assets/images/emagain_34_1.png" /></p>
<p>Now we can try to watch the whole process.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">.4</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">P</span> <span class="o">=</span> <span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">N0</span> <span class="o">=</span> <span class="n">P</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">N1</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">P</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">N0</span><span class="o">/</span><span class="n">N</span>
<span class="n">mu0</span> <span class="o">=</span> <span class="p">(</span><span class="n">P</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="n">N0</span>
<span class="n">mu1</span> <span class="o">=</span> <span class="p">((</span><span class="mi">1</span><span class="o">-</span><span class="n">P</span><span class="p">)</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="n">N1</span>
<span class="n">sigma0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">P</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">mu0</span><span class="p">))</span><span class="o">/</span><span class="n">N0</span><span class="p">)</span>
<span class="n">sigma1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">((</span><span class="mi">1</span><span class="o">-</span><span class="n">P</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">mu1</span><span class="p">))</span><span class="o">/</span><span class="n">N1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">),</span><span class="n">linewidth</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="n">mu0</span><span class="p">,</span> <span class="n">sigma0</span><span class="p">)</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="n">mu1</span><span class="p">,</span> <span class="n">sigma1</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/emagain_36_0.png" /></p>
<p>And the final picture looks like this.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">.5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">u</span><span class="p">,</span><span class="n">theta</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">u</span><span class="p">),</span><span class="n">linewidth</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[<matplotlib.lines.Line2D at 0x7ff77b1b65d0>]
</code></pre></div></div>
<p><img src="/Blog/assets/images/emagain_38_1.png" /></p>
<p>Cool!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
</code></pre></div></div>Jeremy TeitelbaumAnother, more informed look at EM and gaussian mixturesGaussian Mixture in Stan2019-10-28T00:00:00+00:002019-10-28T00:00:00+00:00https://jeremy9959.github.io/Blog/StanMixture<h2 id="gaussian-mixture-model-via-stan-mcmc">Gaussian Mixture Model via Stan (MCMC)</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pystan</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s">'figure.figsize'</span><span class="p">]</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="s">'ggplot'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'old_faithful.csv'</span><span class="p">,</span><span class="n">header</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">scatter</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'eruptions'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'waiting'</span><span class="p">,</span><span class="n">title</span><span class="o">=</span><span class="s">'Old Faithful Dataset'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/stan_mixture_2_0.png" /></p>
<h4 id="the-stan-code-and-fit">The stan code and fit</h4>
<p>The following <a href="https://mc-stan.org">stan</a> code was taken from <a href="https://betanalpha.github.io/assets/case_studies/identifying_mixture_models.html">Michael Betancourt’s discussion of degeneracy in mixture models.</a> The thrust of his article is that symmetry among
the components of the mixture make it a problem for MCMC sampling. In this code, the means mu are given an “ordered” type which breaks that symmetry by artificially distinguishing between the components.</p>
<p><strong>Note:</strong> This handles only one dimension of the data (eruptions).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">beta_code</span><span class="o">=</span><span class="s">"""
data {
int<lower = 0> N;
vector[N] y;
}
parameters {
ordered[2] mu;
real<lower=0> sigma[2];
real<lower=0, upper=1> theta;
}
model {
sigma ~ normal(0, 2);
mu ~ normal(0, 2);
theta ~ beta(5, 5);
for (n in 1:N)
target += log_mix(theta,
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
}
"""</span>
</code></pre></div></div>
<p>We standardize the data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="n">dfs</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">()</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Eruptions (standardized)'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">dfs</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span><span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span><span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/stan_mixture_6_0.png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">pystan</span><span class="o">.</span><span class="n">StanModel</span><span class="p">(</span><span class="n">model_code</span><span class="o">=</span><span class="n">beta_code</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_4a1a8842066ad04ed5bb50ec34a6de05 NOW.
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">faithful_data</span><span class="o">=</span><span class="p">{</span><span class="s">'N'</span><span class="p">:</span><span class="n">dfs</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="s">'y'</span><span class="p">:</span><span class="n">dfs</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]}</span>
<span class="n">fit</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">sampling</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">faithful_data</span><span class="p">,</span><span class="nb">iter</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span><span class="n">warmup</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="the-result-of-the-sampling">The result of the sampling</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">fit</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Inference for Stan model: anon_model_4a1a8842066ad04ed5bb50ec34a6de05.
4 chains, each with iter=10000; warmup=1000; thin=1;
post-warmup draws per chain=9000, total post-warmup draws=36000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu[1] -1.29 1.3e-4 0.02 -1.33 -1.3 -1.29 -1.27 -1.24 31364 1.0
mu[2] 0.69 1.4e-4 0.03 0.63 0.67 0.69 0.71 0.75 44695 1.0
sigma[1] 0.21 1.1e-4 0.02 0.18 0.2 0.21 0.23 0.26 33690 1.0
sigma[2] 0.38 1.3e-4 0.02 0.34 0.37 0.38 0.4 0.43 37065 1.0
theta 0.35 1.4e-4 0.03 0.3 0.34 0.35 0.37 0.41 44342 1.0
lp__ -252.9 0.01 1.58 -256.8 -253.7 -252.5 -251.7 -250.8 16745 1.0
Samples were drawn using NUTS at Tue Oct 29 10:22:55 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
</code></pre></div></div>
<h4 id="plot-of-the-distribution-obtained-from-the-posterior-mean-parameters">Plot of the distribution obtained from the posterior mean parameters.</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span>
<span class="n">fig</span><span class="p">,</span><span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">axes</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">dfs</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span><span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span><span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">x</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span>
<span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="mf">.65</span><span class="o">*</span><span class="n">norm</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">loc</span><span class="o">=</span><span class="mf">.69</span><span class="p">,</span><span class="n">scale</span><span class="o">=</span><span class="mf">.38</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mf">.35</span><span class="p">)</span><span class="o">*</span><span class="n">norm</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">loc</span><span class="o">=-</span><span class="mf">1.29</span><span class="p">,</span><span class="n">scale</span><span class="o">=</span><span class="mf">0.21</span><span class="p">),</span><span class="n">linewidth</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/Blog/assets/images/stan_mixture_12_0.png" /></p>Jeremy TeitelbaumFitting a gaussian mixture in StantSNE on 5000 fashion-MNIST images (bokeh visualization)2019-09-27T00:00:00+00:002019-09-27T00:00:00+00:00https://jeremy9959.github.io/Blog/FashionMNISTtSNE<iframe src="http://tsne-fashion.herokuapp.com" width="900" height="800" frameborder="0"></iframe>Jeremy Teitelbaumbrowse the tSNE clustersCarlsson-Memoli on Hierarchical Clustering2019-09-17T00:00:00+00:002019-09-17T00:00:00+00:00https://jeremy9959.github.io/Blog/CarlssonMemoli<p>The paper <a href="http://www.jmlr.org/papers/volume11/carlsson10a/carlsson10a.pdf">Characterization, Stability, and Convergence of Hierarchical Clustering Methods</a>
by Carlsson and Memoli applies topological ideas to study hierarchical clustering. They obtain a result complementary to that of Kleinberg,
in the sense that they show single-linkage hierarchical clustering does have good categorical properties as a functor from finite metric spaces to ultrametric trees.</p>
<p>By a partition of a finite set $X$ we mean a disjoint covering of $X$ by subsets.</p>
<p><strong>Defintion:</strong> Let $X$ be a finite set and $\theta:[0,\infty]\to \mathcal{P}(X)$ where $\mathcal{P}(X)$ is the set of partitions of $X$. The pair $(X,\theta)$
is a dendrogram if:</p>
<ul>
<li>$\theta(0)$ is the maximal partition of $X$ into one-element subsets.</li>
<li>For sufficiently large $t$, $\theta(t)$ is the trivial partition of $X$ into a single subset.</li>
<li>If $r<s$, then $\theta(r)$ is a refinement of $\theta(s)$.</li>
<li>For all $r$ there exists $\epsilon>0$ so that $theta(r)=\theta(s)$ for all $s\in [r,r+\epsilon]$</li>
</ul>
<p>A dendrogram on $X$ is equivalent to an ultrametric on $X$; that is, a metric that satisfies the stronger triangle inequality $\rho(x,y)\le\max(\rho(x,z),\rho(z,y))$.
An ultrametric is equivalent to a dendrogram; given an ultrametric, the associated dendrogram comes from the partitions whose sets
are the equivalence classes $\rho(x,y)\le r$. Given a dendrogram, the associated ultrametric is setting $\rho(x,y)$ to be the smallest value of $r$ for which
the two points $x$ and $y$ are in the same element of the partition.</p>
<p>A hierarchical clustering algorithm associates a dendrogram to a finite metric space. In fact, Carlsson-Memoli’s
definition of a hierarchical clustering method is a map The simplest (and in some sense, as C-M show, the most canonical)
such algorithm is “single linkage clustering” which has several definitions.</p>
<p><strong>Definition 1:</strong> For each $r$, define an equivalence relation on $X$ so that $x\sim_{r}y$ when there is a sequence of points $x_0=x,x_1,\ldots,x_k=y$
so that all of the distances $\rho(x_i,x_{i+1})$ are less than or equal to $r$. Equivalently, construct an ultrametric by settings
<script type="math/tex">\mu(x,y)=\inf\max\{\rho(x_i,x_{i+1})\}</script>
where the maximum is taken over all the steps in a sequence of points running from $x$ to $y$, and the infimum is taken over all such sequences;
then take the associated dendrogram. This is called the <strong>maximum subdominant ultrametric.</strong> It is the largest among all ultrametrics that are
less than or equal to the original metric $\rho$.</p>
<p>One feature of this construction which may be non-standard is that it allows multiple points to coalesce at the same time. For example,
consider a finite graph where all the edge lengths are one and the distance is the length of the shortest path between vertices. The associated
ultrametric puts all points at distance one from each other, so the associated dendrogram collapses everything at time one. <em>This is different than
the single linkage algorithm in the <code class="language-plaintext highlighter-rouge">sklearn</code> library, for example, which constructs a binary tree by making arbitrary choices of what to merge
when there is a group of equidistant points.</em></p>
<p><strong>Theorem: (C-M)</strong> Let $\mathcal{T}$ be a hierarchical clustering method – that is, a map which associates to a finite metric space $X$
an ultrametric space with the same underlying points. Then the following properties characterize single linkage clustering:</p>
<ul>
<li>On the two point set, where $x$ and $y$ are at distance $\delta$, one obtains the two point set with the ultrametric $u$ such that $u(x,y)=\delta$.</li>
<li>Whenever $\phi X:\to Y$ is a distance non-increasing map, the induced map on ultrametrics is also distance non-decreasing.</li>
<li>The mimimum non-trivial distance between points of $X$ measured by the ultrametric is at least the minimum non-trivial distance measured by the metric.</li>
</ul>
<p>Implicit in this paper by C-M is a functorial notion of hierarchical clustering that is made more explicit in their
paper <a href="https://link.springer.com/article/10.1007/s10208-012-9141-9">Classifying Clustering Schemes</a>.<br />
Let $\mathcal{M}$ be the category of finite metric spaces with distance non-increasing maps, and let $\mathcal{P}$ be the category
of “persistent sets” – where a persistent set is defined exactly like a dendrogram but without the condition that there is a large enough $t$
such that $\theta(t)$ is all of $X$. This means that one has an ultrametric but some of the distances are infinite.</p>
<p>Then the rule that sends a finite metric space $X$ to the same set with the maximum subdominant ultrametric is a functor from the
category of finite metric spaces with distance non-decreasing maps to the category of “persistent sets”, where a morphism $f:X\to Y$
has the property that $f^{*}(\theta_{Y}(r))$ is a refinement of $\theta_X(r)$ for all $r$. Essentially this means that the map is distance
non-increasing between the metric and the ultrametric.</p>
<p>What is particularly interesting about this is that other methods are NOT functorial. For example, one can take what’s called complete linkage,
or average linkage
ein which the distance between clusters is given by the <em>largest</em> distance between a pair of points, one in each cluster, or the average distance
between points in the two clusters. Carlsson-Memoli
show that this rule is NOT functorial, which in practice means that slight perturbations of the metric yield very different dendrograms. Thus
single linkage has good technical properties EVEN THOUGH in practice it can take a long string of points in a line and put them in a single
cluster.</p>
<p>The last part of the C-M paper considers maps between different metric spaces, and how that relates to the clustering. They use the <em>Gromov-Hausdorff</em>
distance to compare the original metric spaces and the associated ultrametric spaces. They prove that the functor giving single linkage
clustering reduces the GH distance. The most interesting part of this is a beautiful result.</p>
<p><strong>Theorem:</strong> (C-M, Theorem 28) Let $Z$ be a compact metric space. Let $X$ and $X’$ be any two finite subsets of $Z$ and let $Y$ and $Y’$ be
the associated ultrametric spaces obtained by the single-linkage clustering functor. Then:</p>
<ul>
<li>The GH distance between $Y$ and $Y’$ is bounded above by $d_{H}^{Z}(X,Z)+d_{H}^{Z}(X’,Z)$ where $d_{H}^{Z}$ is the Hausdorff distance in $Z$.</li>
<li>Suppose that $Z$ decomposes into compact, disjoint, path-connected components $Z_{1},\ldots, Z_{m}$. Let $A$ be the finite metric space whose points
are the components $Z_{i}$ of $Z$ and whose distances are by the Hausdorff distance between compact sets. Then if $d_{H}^{Z}(X,Z)<\mathrm{sep}(A,d_{A})/2$,
we know that the GH-distance between $Y$ and $A$ with its ultrametric is at most $d_{H}^{Z}(X,Z)$.</li>
<li>If $X_{n}$ is a sequence of finite subsets of $Z$, with the induced metric, and so that $d_{H}^{Z}(X_n,Z)\to 0$ as $n\to\infty$, Then the GH distance
between $X_n$ with the single linkage ultrametric and $A$ with the ultrametric goes to zero.</li>
</ul>
<p>In other words, the clustering structure of finite subsets of a compact space captures, in the limit, the clustering structure of the components.</p>
<p>Question: does the fact that this all takes place in a compact space avoid the chaining phenomenon?</p>Jeremy TeitelbaumCarlsson and Memoli show that single-linkage hierarchical clustering has good properties (and other types do not)Kleinberg’s Impossibility Theorem for Clustering2019-09-05T00:00:00+00:002019-09-05T00:00:00+00:00https://jeremy9959.github.io/Blog/KleinbergsClusteringTheorem<p>In the paper <a href="https://www.cs.cornell.edu/home/kleinber/nips15.pdf">An Impossibility Theorem for Clustering</a>,
Jon Kleinberg introduces three simple properties that one might hope a clustering algorithm would satisfy,
and then proves that no algorithm can satisfy all three.</p>
<p>Suppose given a set $S$ with $n\ge 2$ points and a distance function $d: S\times S\to \mathbf{R}$ such that $d(i,j)=0$ only if $i=j$
and such that $d(i,j)=d(j,i)$ for all $(i,j)\in S\times S$. Note that we don’t assume that, for example, $d$ is a metric,
but the theorem is true even if we restrict to that class of distance functions.</p>
<p>Let $\mathcal{D}(S)$ denote the set of distance functions on $S$ and let $\Pi(S)$ be the set of partitions of $S$ into disjoint subsets.</p>
<p><strong>Definition:</strong> A clustering function is a function $f:\mathcal{D}(S)\to \Pi(S)$; given a distance function, $f$ returns
a partition of $S$ into disjoint clusters.</p>
<p>Kleinberg considers three properties one might expect of a clustering function.</p>
<ul>
<li><em>Scale Invariance</em>: This asserts that $f(d)=f(\alpha d)$ for all distance functions $d\in\mathcal{D}(S)$ and all real $\alpha>0$.</li>
<li><em>Richness</em>: Given a partition $\Gamma\in \Pi(S)$, there is a $d\in \mathcal{D}(S)$ so that $f(d)=\Gamma$. In other words, $f$ is surjective.</li>
<li><em>Consistency</em>: Suppose $d$ and $d’$ are two distance functions and let $\Gamma$ be partition of $S$. We say that $d’$ is a $\Gamma$-transformation of $d$ if $d’(i,j)\le d(i,j)$ for all pairs $(i,j)$ belonging to the same cluster in $\Gamma$, while $d’(i,j)\ge d(i,j)$
for all pairs belonging to different clusters. Then $d$ and $d’$ are consistent if, whenever $d’$ is an $f(d)$-transformation of $d$,
then $f(d’)=f(d)$.</li>
</ul>
<p>There are clustering functions that satisfy any two of the three
conditions. For concreteness assume that $S$ are the nodes of a
graph, connected by edges of weight $d(i,j)$. The clustering
functions find subgraphs of this graph by choosing a subset of the
edges according to a rule.</p>
<ol>
<li>Fix $1<k<n$. Put the edges in order by non-decreasing weight and add edges to the subgraph until it has exactly $k$-connected
components. Use lexicographic order to break ties. These components are the clusters. (This is agglomerative clustering)</li>
<li>Fix a distance $r$ and add all edges of weight at most $r$. The connected components are the clusters.</li>
<li>Fix $1>\alpha>0$ and let $R$ be the maximum value of $d$. Add edges of weight at most $\alpha d$.</li>
</ol>
<p><strong>Proposition:</strong> Method 1 satisfies Scale-invariance and Consistency; Method 2 satisfies Scale-Invariance and Richness;
Method 3 satisfies Richness and Consistency.</p>
<p>In case <em>1</em>, scaling the lengths doesn’t affect their order so the same components are constructed. To see consistency,
assume $\Gamma$ is the partition arising from $d$. If $d’$ a $\Gamma$-transformation of $d$, it means that $d’(i,j)\le d(i,j)$
for all edges $i,j$ added to the subgraph, while $d’(i,j)\ge d(i,j)$ for all edges not yet in the subgraph. Since we used
lexicographic order to break ties, and that doesn’t depend on $d$ or $d’$, the two distance functions yield the same ordered list of edges and thus the same subgraph and the same clusters. Since you never get more than $k<n$ clusters, you don’t have richness.</p>
<p>In case <em>2</em>, you don’t have scale invariance because changing $r$ changes the clusters. If you want to get a particular cluster,
you can have all edges within a cluster have weight smaller than one, and all edges that cross clusters have weight greater than 1.
For consistency, suppose that $d$ gives rise to a particular partition and that $d’(i,j)\le d(i,j)$ within clusters and
$d’(i,j)\ge d(i,j)$ between clusters. Then you end up adding exactly the same edges to the subgraph for $d’$, and therefore you get
the same $\Gamma$.</p>
<p>In case <em>3</em>, it’s clearly scale invariant and for a particular set $S$ you can choose a $d$ that is $1$ within the desired
clusters and greater than $1/\alpha$ between them. Thus you can get any set of clusters you want. To see that
consistency fails, we need at least three points and so we have at least three edges. Suppose that $d$ is the constant function
taking the value $1$. Then the associated clusters are the distinct points. Now choose just one pair of points and construct $d’$ with
the value $1/\alpha$ there while $d’=1$ everywhere else. Now between clusters we will have $d’\ge d$. However, the maximum is now $1/\alpha$ so the clustering algorithm joins all points at distance less than or equal to $1$; so we end up with all the points, except possible one of them, joined together.</p>
<p><strong>Theorem: (Kleinberg)</strong> For each $n\ge 2$ there is no clustering function $f$ satisfying Scale-Invariance, Richness, and Consistency.</p>
<p>The proof follows from this theorem.</p>
<p><strong>Theorem:</strong> If a clustering function satisfies scale-invariance and consistency, then the range of $f$ is an antichain – meaning
that there is no pair $d$, $d’$ of distance functions so that $f(d)$ is a refinement of $f(d’)$. Put another way, if $f(d)$ is
a refinement of $f(d’)$, then $f(d)=f(d’)$.</p>
<p>To prove this second theorem, suppose $f$ is a clustering function that satisfies consistency and scale invariance. Let $d’$
be a distance function
let $\Gamma’=f(d’)$, let $d$ be another distance function, and suppose that $\Gamma=f(d)$ is a refinement of $\Gamma’$.</p>
<p>Let $a’$ be the minimum distance among points in the same cluster of $\Gamma’$, and let $b’$ be the maximum distance
among points in different clusters of $\Gamma’$. Choose $a$ and $b$ similarly for $\Gamma$. Consistency tells us
that if $d^{\dagger}$ is a distance function that is less than $a’$ within clusters of $\Gamma’$, and greater than $b’$ between them,
then $d^{\dagger}=d’$; and similarly for $a$, $b$, $\Gamma$, and $d$.</p>
<p>Since $\Gamma$ is a refinement of $\Gamma’$, if $i,j$ are in different clusters of $\Gamma’$ they must be in different clusters of
$\Gamma$.</p>
<p>Choose $0<\epsilon<aa’/b$. Define a new distance function $d^{\dagger}$ with the following properties:</p>
<ul>
<li>$d^{\dagger}(i,j)=\epsilon$ if $i,j$ are in the same cluster of $\Gamma$</li>
<li>$d^{\dagger}(i,j) = a’$ if $i,j$ are in the same cluster of $\Gamma’$, but not of $\Gamma$.</li>
<li>$d^{\dagger}(i,j) = b’$ if $i,j$ are in different clusters of $\Gamma’$.</li>
</ul>
<p>The second two properties say that $d^{\dagger}$ is consistent with $d’$, so $f(d^{\dagger})=\Gamma’$. On the other hand, let $\alpha = b/a’$.
Then $\alpha d^{\dagger}$ satisfies</p>
<ul>
<li>$\alpha d^{\dagger}(i,j) = (b/a’)\epsilon < a$ if $i,j$ are in the same cluster of $\Gamma$</li>
<li>$\alpha d^{\dagger}(i,j) = (b/a’)a’ = b$ if $i,j$ are in the same cluster of $\Gamma’$ but not of $\Gamma$.</li>
<li>$\alpha d^{\dagger}(i,j) = (b/a’)b’ \ge b$ since $b’\ge a’$ if $i,j$ are in different clusters of $\Gamma’$.</li>
</ul>
<p>The upshot of this is that $\alpha d^{\dagger}$ is gives the same partition as $d$ by consistency, and the same partition as $d’$
by scale invariance – in other words, $\Gamma=\Gamma’$.</p>
<p>This proof does not try to preserve any additional properties of $d$, such as the triangle inequality; Kleinberg shows that
one can improve the inequalities and preserve such additional conditions without affecting the result.</p>
<p>To fully characterize possible partitions, Kleinberg proves that any antichain can arise via a clustering algorithm
satisfying all three conditions.</p>
<p><strong>Theorem:</strong> Given any antichain of partitions (that is, any set of partitions, none of which are refinements of another one),
there is a clustering function whose range is that antichain.</p>
<p>Fix an antichain $\mathcal{A}$ and consider the objective function
<script type="math/tex">\Phi_{d}(\Gamma) = \sum_{(i,j)\sim\Gamma} d(i,j)</script>
where $(i,j)\sim\Gamma$ means $i$ and $j$ are in the same subset of $\Gamma$. Let $\Gamma(d)$ be the partition <strong>in</strong> $\mathcal{A}$
that minimizes this objective function. Kleinberg shows that this clustering function is scale-invariant and consistent,
and has range $\mathcal{A}$. The proof shows how to define $d$ for a given partition $\Gamma$, and then shows that
the result is consistent. Note (as does Kleinberg) that the minimization must consider only partitions in $\mathcal{A}$ or
one will obtain the trivial partition.</p>Jeremy TeitelbaumKleinberg shows that it's impossible to find a clustering algorithm that satisfies three simple properties.