Math-5800-Spring-2020

Course Information

What is this course about

Machine Learning problems fall into two general categories: supervised and unsupervised learning.

In supervised learning one starts with a large set of “known data” and then tries to fit new data into the pattern determined by what’s known. For example, classifying an image based on a large dataset of labelled images, or predicting a home price based on a database of homes and their selling prices, are supervised problems.

In unsupervised learning one seeks to uncover patterns in data ab initio. For one example, given RNA expression data for a large group of cells, one tries to identify different types of cells that are similar to each other without knowing in advance how to characterize cell types. For another, one might try to extract different “personality types” from the results of a questionnaire by identifying groups of people whose answers are similar to each other. One rather spectacular type of unsupervised learning is teaching a system to play a game by having it develop strategies from scratch.

Data Science/Machine Learning is a “hot topic” and the web is filled, not only with a very deep scientific literature and powerful open source software packages, but also tutorials, blog posts, and entire courses on different approaches to machine learning.

Course Overview

The goal of this course is for each student (or, better, each of several small groups of students) to choose a machine learning problem and dataset and to complete a project that includes the following elements:

  1. An overview of the problem and how it fits into the general machine learning landscape.
  2. Identification of a relevant dataset and a presentation of the notable features of the selected dataset.
  3. A choice of approach with justification.
  4. An overview of the mathematical foundations of the approach. This overview should include a more detailed mathematical explanation of at least one aspect of the technique being employed.
  5. A fully documented implementation of the approach using available software tools.
  6. A report on the conclusions of the project.

Depending on your team’s strengths and interests, you may emphasize different aspects of this project. For example:

Fundamentally, machine learning involve minimizing the value of a “loss function” by adjusting parameters. The most common approach to this is gradient descent. Another approach to optimization is via Monte Carlo methods. You could focus your project on optimization.

I expect that the final project will be presented as a git repository with associated gh-pages documentation for the reports, but we can discuss alternatives to this.

Course meetings

I expect regular attendance at class meetings.

The course meetings will be devoted to progress reports and the sharing of suggestions between groups. Because we are trying to learn a huge subject in a short time, and because the participants have varying background, we all have something to contribute.

Each team should be prepared to show some element of their project at each meeting, whether to illustrate progress or to raise a question.

Where to look for data

Kaggle

The Kaggle site hosts data science competitions with cash (and other) prizes. Many of these competitions are very sophisticated and doing well in them requires expertise and also specialized technique. Nevertheless, the kaggle website is a great source of both datasets and ideas for projects, even if you aren’t actually competing.

If you are interested in doing one of these competitions, we can talk.

Other Data Sources

References

The following books are standard references. A project can be built on just one chapter (or section of a chapter!) of these books.

It’s also worth knowing that one can access the entire O’Reilly Library through the UConn Library.