Mathematical Aspects of Machine Learning
Machine Learning problems fall into two general categories: supervised and unsupervised learning.
In supervised learning one starts with a large set of “known data” and then tries to fit new data into the pattern determined by what’s known. For example, classifying an image based on a large dataset of labelled images, or predicting a home price based on a database of homes and their selling prices, are supervised problems.
In unsupervised learning one seeks to uncover patterns in data ab initio. For one example, given RNA expression data for a large group of cells, one tries to identify different types of cells that are similar to each other without knowing in advance how to characterize cell types. For another, one might try to extract different “personality types” from the results of a questionnaire by identifying groups of people whose answers are similar to each other. One rather spectacular type of unsupervised learning is teaching a system to play a game by having it develop strategies from scratch.
Data Science/Machine Learning is a “hot topic” and the web is filled, not only with a very deep scientific literature and powerful open source software packages, but also tutorials, blog posts, and entire courses on different approaches to machine learning.
The goal of this course is for each student (or, better, each of several small groups of students) to choose a machine learning problem and dataset and to complete a project that includes the following elements:
Depending on your team’s strengths and interests, you may emphasize different aspects of this project. For example:
if you are more interested in the problem of harvesting data from the web (such as images, text, or social relationships) then you might want to emphasize step 2 and use a simpler algorithm in steps 3-4 to show that the data you’ve harvested is interesting.
if you are more interested in the mathematical foundations of an algorithm, you could start with an algorithm and emphasize step 4, working carefully through the mathematics. Some of the algorithms worth investigating are:
Fundamentally, machine learning involve minimizing the value of a “loss function” by adjusting parameters. The most common approach to this is gradient descent. Another approach to optimization is via Monte Carlo methods. You could focus your project on optimization.
I expect that the final project will be presented as a git repository with associated gh-pages documentation for the reports, but we can discuss alternatives to this.
I expect regular attendance at class meetings.
The course meetings will be devoted to progress reports and the sharing of suggestions between groups. Because we are trying to learn a huge subject in a short time, and because the participants have varying background, we all have something to contribute.
Each team should be prepared to show some element of their project at each meeting, whether to illustrate progress or to raise a question.
The Kaggle site hosts data science competitions with cash (and other) prizes. Many of these competitions are very sophisticated and doing well in them requires expertise and also specialized technique. Nevertheless, the kaggle website is a great source of both datasets and ideas for projects, even if you aren’t actually competing.
If you are interested in doing one of these competitions, we can talk.
UCI Machine Learning Repository contains about 500 datasets on all sorts of things for use “by the machine learning community.”
CIFAR are large sets of labelled images. CIFAR-10 has 10 classes (like airplane, automobile, bird,…) and 6000 images per class. CIFAR-100 has 100 classes with 600 images each.
80 million tiny images contains 80 million tiny images that have been used in a variety of interesting projects.
One can download Wikipedia. Here is a description of how to access wikipedia programmatically.
Twitter has an API from which one can extract lots of information.
There is a lot of data on sports, for example NFL Savant has free tables listing every play in the season.
The National Institutes of Health has a huge archive of data for those with the expertise to read it; see The Cancer Genome Atlas for example.
The Network Repository has a wide range of graphs/networks to study.
The following books are standard references. A project can be built on just one chapter (or section of a chapter!) of these books.
The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman.
Pattern Recognition and Machine Learning by Christopher Bishop.
Probabilistic Graphical Models by Koller and Friedman.
It’s also worth knowing that one can access the entire O’Reilly Library through the UConn Library.