Goals for January 27 - February 3 | Math-5800-Spring-2020

Goals for January 27 - February 3

To summarize, the goals for this week are:

to form some preliminary working groups
for each participant to create a github site
to create an initial list of resources and references and document them on that site

There is also a reading assignment:

What is data science? Chapter 1 of Doing Data Science by O’Neill and Schutt. UConn NetID required – Available through the UConn Library.

Form some preliminary working groups

As we learned last week, people in the class have diverse interests. Reading over my notes I can identify a few topic area themes that people are interested in. These included:

using satellite image data to identify world hot spots
using medical imaging to diagnose cancer
using machine learning to identify art forgeries
teaching computers to win at games
finding signals in the genome that control transcription of different proteins
using machine learning to translate EEG data to motion control of devices such as wheelchairs
integrating information from the news with financial data to make predictions

As the skills and interests whiteboards show, we have a mixture of expertise, with some relative newcomers to machine learning in the course, and some people with a lot of math background, along with some relative newcomers.

The first goal for this week is to form preliminary working groups. We’ve all had some opportunity to hear from other class members about their interests. I suggest that these initial groups be organized around broad application areas – so that, for instance, several people interested in working with images can form a group, even if they have different particular applications in mind.

These groups are preliminary, you’re not committing yourself to a joint project for the semester. I recommend shooting for 3-4 people. I’m not much for social engineering so if you’d prefer to work alone, that’s ok. It is a fact of life, however, that industry and academic work in data science/machine learning is a team sport.

Bottom line is that by Wednesday, January 28 I’d like to know who is working together to start.

Create GitHub sites

GitHub is a website (really a cloud services provider) that was created to support large open-source software development projects. Recently acquired by Microsoft, it is a key source for making and sharing software and documentation.

My github profile is here. It consists of some profile information plus a bunch of repositories, each of which contains code or documents related to a projects I am (or was) working on (in widely varying states of completion).

One of the repositories is the source code for the website for this course and I use GitHub to generate the jeremy9959.net page automatically. Making web pages using GitHub is pretty easy but we’ll save that for later.

The backend of the GitHub site is the software tool called git. Git is a beautiful piece of software that allows you to track versions of your work, undo changes, make experiments without messing up stuff that works, and collaborate with others. It is just as useful for working on shared documents (like joint papers) as it is for software.

To elaborate a little, you would use git to control manage the files in a directory on your computer, to keep track of changes and make checkpoints so you can get back to a working state if you mess up your project. With GitHub you can reproduce your directory in the cloud, share it with others, and also have a backup in case something happens to your local machine. There are other cloud providers that do things like GitHub, such as GitLab, but GitHub is the biggest and best known.

Knowing the basics of how to use git and GitHub is a baseline skill for any scientist – or anyone wanting to work in a technical field in industry. Therefore:

The second goal for this week is for everyone in the class to have a GitHub site and a repository associated with their project for this course.

To install Git on a Mac or Linux machine, you can use the downloads from the git site. For Windows I recommend using the gitforwindows version, which also installs a command line shell (git-bash) that works well with git.
Alternatively you can install GitHub Desktop which gives you a git-shell that also works well on windows.

For some of you this may take 15 minutes, but if you are new to Git and GitHub, here are some references:

Git-it is an app that walks you through the basics. It’s fun and worth a try.
Git Hello World walks you through making a repository on the GitHub site without using your local computer.
Git Handbook is an overview of of Git with some discussion of how it relates to GitHub.
This udacity walkthrough looks like a good introduction, if you can stand the upselling.
Ask for help from someone who knows what they are doing (a classmate, a friend, or if all else fails me!)

Collect Examples and References

The final goal for this week is to begin a library of examples and references related to the general area your group is interested in, and to document that library in your github site in the README file.

You can see my example of this for my (hypothetical) twitter project in my demo github repo.

This is essentially a google/library research project. A reasonable goal is 6-10 references of at least three different types. You should provide a brief indication of what the reference contains. These links are supposed to be useful to you so don’t just blindly copy them.

The types I have in mind include:

Books. These may be old-fashioned but they are often definitive, detailed accounts of exactly what you need to know. Here are three particularly useful sources:
- Pattern Recognition and Machine Learning, by Christopher Bishop while 10 years old or so, has a detailed account of many different approaches to machine learning problems with lots of mathematical background, as well as problems to try.
- [The elements of statistical learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf] by Hastie, Tibshirani, and Friedman is an authoritative introduction to machine learning from the perspective of applied statistics.
- The OReilly Library contains many detailed, and practical introductions to implementations of machine learning algorithms and pretty much anything else in the computer world. Its entire content is freely available through the UConn library. Go to the O’Reilly site and log in using “SSO” and your UConn netid.
Published scholarly articles from journals or conference proceedings. You can identify them because they have references to the journal or conference. They are typically quite condensed and rather technical, but they sometimes give details not available elsewhere. It’s very important to consider the place the article was published or the conference where the work was presented in deciding whether the work is likely to be useful - you can get anything published somewhere, especially in this field.

A premiere conference in neural networks is NIPS (Neural Information Processing Systems) and a lot of breakthroughs appear in its proceedings. One such is the 2012 paper by Krizhevsky, Suskever, and Hinton called ImageNet Classification with Deep Convolutional Neural Networks, which demonstrated the feasibility of training huge neural networks to do image classification on massive scale. This paper has 55500 citations on google scholar, or roughly 7000 per year since publication. The Journal of Machine Learning Research is a premier journal in this field, where for example this breakthrough paper Visualizing Data with t-SNE describing the now widely used t-sne algorithm for clustering was described. It ‘only’ has 11500 citations since 2008.
Arxiv preprints. The Arxiv is a preprint server in the sciences – it’s the same resource used by mathematicians. It’s unrefereed so it has the advantage that you get quick access to results and it has the disadvantage that it has no barrier to entry. A search for neural networks revealed something like 75 hits published in a two day period last week, so it’s a flood of stuff.
Blog posts, tutorials, and self-published articles. Two big aggregators that come up in these searches are Medium and TowardsDataScience. The articles that appear here are sometimes very useful, and include very practical, step by step instructions on how to do things. For example, I found this article on the visualization tool bokeh very helpful and clearly written. But beware, because anyone can publish in these settings.
Tutorials from software packages. These will walk you through how to do basic things in a particular language so you can learn to use that package. Useful for the intended purpose.
Coursera, Udemy, Udacity, and other MOOC providers. Can be very well done and systematic, but can be VERY expensive and sometimes rather general.
Course notes from other university courses. These can be very useful and given you all the detailed stuff you wish I was giving you. For example, the famous Stanford CS229 course website has examples of projects done by students in that class going back years.
Software, usually on GitHub. For example, I mentioned that I was interested in scraping data from twitter. I googled “scrape twitter data python” and found this blog post: How to Scrape Tweets from Twitter. That blog post referred me to the software package tweepy. From there I could read the tweepy documentation AND find the tweepy github site where I can look at all the source code.

Where should I end up?

Here is my demo github repository for my hypothetical twitter project.