What you don't need, to deep learning

Myth(don't need) Truth
Lots of math Just high school math is sufficient
Lots of data We've seen record-breaking results with <50 items of data
Lots of expensive computers You can get what you need for state of the art work for free

A lot of world class research projects have come out of the fastai students based on a single GPU, using small data or without a traditional background.

Where is deep learning the best known approach?

Neural Networks: A Brief History

The Start

In 1943, Warren McCulloch and Walter Pitts developed a mathematical model of an artificial neuron.

In 1957, Frank Rosenblatt built the first device that actually used these principles, the Mark I Perceptron at Cornell.

We are now about to witness the birth of such a machine–-a machine capable of perceiving, recognizing and identifying its surroundings without any human training or control.

---- Frank Rosenblatt

The First AI Winter

1969, Marvin Minsky adn Seymour Papert wrote a book called Perceptrons and pointed out that a single layer of a NN cannot learn some simple but critical functions (such as XOR) and using multiple layers of the devices would allow these limitations to be addressed. Unfortunately, only the first of these insights was widely recognized.

The Second Winter

In theory, adding just one extra layer of neurons was enough to allow any mathematical function to be approximated with these neural networks, but in practice such networks were often too big and too slow to be useful. Although researchers showed 30 years ago that to get practical good performance you need to use even more layers of neurons, it is only in the last decade that this principle has been more widely appreciated and applied.

The fast.ai Learning Philosophy

Projects and Mindset

It helps to focus on your hobbies and passions–-setting yourself four or five little projects rather than striving to solve a big, grand problem tends to work better when you're getting started.

Questionnaire

After every chapter there are questionnaire instead of a summory of keynotes. It doesn't matter how many you get right but it just confirms that you haven't missed anything important. If you don't understand something after some time just continue and come back after a few chapters.

The Software: PyTorch, fastai, and Jupyter

fastai is built top of PyTorch and these are written in Python and it's the language we will use during this course. Many people think that fastai is just for beginners and teachers but it's actually using layered API which makes it infinite customizable and practical for every purpose.

Jupyter Notebook is coding environment often used by DL people. It's easier to experiment things using Jupyter Notebooks than running Python code in terminal. Linux highly recommended.

from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
epoch train_loss valid_loss error_rate time
0 0.160716 0.019510 0.006089 00:50
epoch train_loss valid_loss error_rate time
0 0.071681 0.011917 0.004060 00:55

Above code classifies cat and dog images. This is a deep learning model. Notice that this happened in less than 2 minute.

What Is Machine Learning?

The training of programs developed by allowing a computer to learn from its experience, rather than through manually coding the individual steps.

In 1949, an IBM researcher named Arthur Samuel started working on machine learning.

By 1961 his checkers-playing program beat the Connecticut state champion.

In his classic 1962 essay "Artificial Intelligence: A Frontier of Automation", he wrote:

Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance.

We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would "learn" from its experience.

using a model after it's trained looks like:

Limitations Inherent To Machine Learning
  • A model cannot be created without data.
  • A model can only learn to operate on the patterns seen in the input data used to train it.
  • This learning approach only creates predictions, not recommended actions.
  • It's not enough to just have examples of input data; we need labels for that data too
Neural Network
  • Universal approximation theorem shows that Neural Network can solve any problem to any level of accuracy, in theory.
  • In practice, they are often a suitable kind of model, and you can focus your effort on the process of training them—that is, of finding good weight assignments.
Stochastic Gradient Descent (SGD)
  • One could imagine that you might need to find a new "mechanism" for automatically updating weights for every problem.
  • a completely general way to update the weights of a neural network, to make it improve at any given task.

  • the effectiveness of ... → loss
  • weight assignment → parameters of Neural Networks
  • performance → predictions
  • mechanism for ... → stochastic gradient descent (SGD)
  • maximize the performance → Minimize the Loss

How Our Image Recognizer Works

from fastai2.vision.all import *

Normally people import only parts they need but this method imports everything. fastai library is designed in a way that this only gives you the parts you need.

path = untar_data(URLs.PETS)/'images'

This downloads the dataset to computer if not exists.

def is_cat(x): return x[0].isupper()

This checks if the first character is uppercase because this is how cats are labeled.

dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))

This creates "data loader":

  • from_name_func: labels can be extracted using a function applied to the filename
  • valid_pct=0.2: randomly hold out 20% of the data as validation set; the remaining 80% is called the training set
  • seed=42: sets the random seed to make sure any differences are due to the changes to the model
  • item_tfms=Resize(224): each item is resized to a 224-pixel square

Why 224 pixels? This is the standard size for historical reasons (old pretrained models require this size exactly), but you can pass pretty much anything. If you increase the size, you'll often get a model with better results (since it will be able to focus on more details), but at the price of speed and memory consumption; the opposite is true if you decrease the size.

learn = cnn_learner(dls, resnet34, metrics=error_rate)

Learner is the thing that learns.

  • It gets the data loader, architecture (in this case resnet34), and the metric human uses to evaluate it.
  • cnn_learner also has a parameter pretrained, which defaults to True.

Using pretrained models is the most important method we have to allow us to train more accurate models, more quickly, with less data, and less time and money.

The 34 in resnet34 refers to the number of layers in this variant of the architecture (other options are 18, 50, 101, and 152). Models using architectures with more layers take longer to train, and are more prone to overfitting (i.e. you can't train them for as many epochs before the accuracy on the validation set starts getting worse). On the other hand, when using more data, they can be quite a bit more accurate.

learn.fine_tune(1)
  1. Use one epoch to fit just those parts of the model necessary to get the new random head to work correctly with your dataset.
  2. Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which, as we'll see, generally don't require many changes from the pretrained weights).

Jargon Recap

Term Meaning
Label The data that we're trying to predict, such as "dog" or "cat"
Architecture The template of the model that we're trying to fit; the actual mathematical function that we're passing the input data and parameters to
Model The combination of the architecture with a particular set of parameters
Parameters The values in the model that change what task it can do, and are updated through model training
Fit Update the parameters of the model such that the predictions of the model using the input data match the target labels
Train A synonym for fit
Pretrained model A model that has already been trained, generally using a large dataset, and will be fine-tuned
Fine-tune Update a pretrained model for a different task
Transfer learning Using a pretrained model for a task different to what it was originally trained for
Epoch One complete pass through the input data
Loss A measure of how good the model is, chosen to drive training via SGD
Metric A measurement of how good the model is, using the validation set, chosen for human consumption
Validation set A set of data held out from training, used only for measuring how good the model is
Training set The data used for fitting the model; does not include any data from the validation set
Overfitting Training a model in such a way that it remembers specific features of the input data, rather than generalizing well to data not seen during training
CNN Convolutional neural network; a type of neural network that works particularly well for computer vision tasks
Classification vs Regression
  • A classification model is one which attempts to predict a class, or category
  • A regression model is one which attempts to predict one or more numeric quantities, such as a temperature or a location.
Metric vs Loss
  • Loss: a "measure of performance" easy for stochastic gradient descent to use
  • metric: easy for you to understand and close to what you want the model to do
Overfitting

the single most important and challenging issue when training

The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images.

Tip: Only Avoid Overfitting When Needed: We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn’t need to do so, ending up with a model that may be less accurate than what they could have achieved.

Use Judgment in Defining Test Sets

Suppose you want to split the time series data into training and validation sets.

A random subset is a poor choice

Instead, use the earlier data as your training set (and the later data for the validation set).

In the Kaggle distracted driver competition, the independent variables are pictures of drivers at the wheel of a car, and the dependent variables are categories such as texting, eating, or safely looking ahead. If you put one of the images in your training set and one in the validation set, your model will have an easy time making a prediction for the one in the validation set, so it will seem to be performing better than it would on new people. Another perspective is that if you used all the people in training your model, your model might be overfitting to particularities of those specific people, and not just learning the states (texting, eating, etc.).

Note: Validation Sets and Test Sets: Just as we hold back the validation data from the training process, we must hold back the test set data even from ourselves.

Before the Next Lesson

  • Setup all tools
  • Get comfortable with the notebook, docs, and style of writing Python