To summarize, the goals for this week are:
There is also a reading assignment:
What is data science? Chapter 1 of Doing Data Science by O’Neill and Schutt. UConn NetID required – Available through the UConn Library.
As we learned last week, people in the class have diverse interests. Reading over my notes I can identify a few topic area themes that people are interested in. These included:
As the skills and interests whiteboards show, we have a mixture of expertise, with some relative newcomers to machine learning in the course, and some people with a lot of math background, along with some relative newcomers.
The first goal for this week is to form preliminary working groups. We’ve all had some opportunity to hear from other class members about their interests. I suggest that these initial groups be organized around broad application areas – so that, for instance, several people interested in working with images can form a group, even if they have different particular applications in mind.
These groups are preliminary, you’re not committing yourself to a joint project for the semester. I recommend shooting for 3-4 people. I’m not much for social engineering so if you’d prefer to work alone, that’s ok. It is a fact of life, however, that industry and academic work in data science/machine learning is a team sport.
Bottom line is that by Wednesday, January 28 I’d like to know who is working together to start.
GitHub is a website (really a cloud services provider) that was created to support large open-source software development projects. Recently acquired by Microsoft, it is a key source for making and sharing software and documentation.
My github profile is here. It consists of some profile information plus a bunch of repositories, each of which contains code or documents related to a projects I am (or was) working on (in widely varying states of completion).
One of the repositories is the source code for the website for this course and I use GitHub to generate the jeremy9959.net page automatically. Making web pages using GitHub is pretty easy but we’ll save that for later.
The backend of the GitHub site is the software tool called git. Git is a beautiful piece of software that allows you to track versions of your work, undo changes, make experiments without messing up stuff that works, and collaborate with others. It is just as useful for working on shared documents (like joint papers) as it is for software.
To elaborate a little, you would use git to control manage the files in a directory on your computer, to keep track of changes and make checkpoints so you can get back to a working state if you mess up your project. With GitHub you can reproduce your directory in the cloud, share it with others, and also have a backup in case something happens to your local machine. There are other cloud providers that do things like GitHub, such as GitLab, but GitHub is the biggest and best known.
Knowing the basics of how to use git and GitHub is a baseline skill for any scientist – or anyone wanting to work in a technical field in industry. Therefore:
The second goal for this week is for everyone in the class to have a GitHub site and a repository associated with their project for this course.
To install Git on a Mac or Linux machine, you can use the downloads from the git site.
For Windows I recommend using the gitforwindows version, which
also installs a command line shell (git-bash) that works well with git.
Alternatively you can install GitHub Desktop which gives you a git-shell that also works well on windows.
For some of you this may take 15 minutes, but if you are new to Git and GitHub, here are some references:
The final goal for this week is to begin a library of examples and references related to the general area your group is interested in, and to document that library in your github site in the README file.
You can see my example of this for my (hypothetical) twitter project in my demo github repo.
This is essentially a google/library research project. A reasonable goal is 6-10 references of at least three different types. You should provide a brief indication of what the reference contains. These links are supposed to be useful to you so don’t just blindly copy them.
The types I have in mind include:
Published scholarly articles from journals or conference proceedings. You can identify them because they have references to the journal or conference. They are typically quite condensed and rather technical, but they sometimes give details not available elsewhere. It’s very important to consider the place the article was published or the conference where the work was presented in deciding whether the work is likely to be useful - you can get anything published somewhere, especially in this field.
A premiere conference in neural networks is NIPS (Neural Information Processing Systems) and a lot of breakthroughs appear in its proceedings. One such is the 2012 paper by Krizhevsky, Suskever, and Hinton called ImageNet Classification with Deep Convolutional Neural Networks, which demonstrated the feasibility of training huge neural networks to do image classification on massive scale. This paper has 55500 citations on google scholar, or roughly 7000 per year since publication. The Journal of Machine Learning Research is a premier journal in this field, where for example this breakthrough paper Visualizing Data with t-SNE describing the now widely used t-sne algorithm for clustering was described. It ‘only’ has 11500 citations since 2008.
Arxiv preprints. The Arxiv is a preprint server in the sciences – it’s the same resource used by mathematicians. It’s unrefereed so it has the advantage that you get quick access to results and it has the disadvantage that it has no barrier to entry. A search for neural networks revealed something like 75 hits published in a two day period last week, so it’s a flood of stuff.
Blog posts, tutorials, and self-published articles. Two big aggregators that come up in these searches are Medium and TowardsDataScience. The articles that appear here are sometimes very useful, and include very practical, step by step instructions on how to do things. For example, I found this article on the visualization tool bokeh very helpful and clearly written. But beware, because anyone can publish in these settings.
Tutorials from software packages. These will walk you through how to do basic things in a particular language so you can learn to use that package. Useful for the intended purpose.
Coursera, Udemy, Udacity, and other MOOC providers. Can be very well done and systematic, but can be VERY expensive and sometimes rather general.
Course notes from other university courses. These can be very useful and given you all the detailed stuff you wish I was giving you. For example, the famous Stanford CS229 course website has examples of projects done by students in that class going back years.