How to build an AI app that classifies images of dogs according to their breed?

This is the Capstone Project of the Data Scientist Nanodegree from Udacity

15 min readJun 17, 2021

Project Overview

This project is one of the Data Scientist Capstone projects and it uses Convolutional Neural Network to build a pipeline to process real-world, user-supplied images. Given an image of a dog, it will make an estimate of the canine’s breed. If supplied an image of a human, it will provide an estimate of the dog breed that is most resembling. Cool, right? And you can also use it as part of a mobile or web app.

Problem Statement

Given an image of any size of a dog or a human, this app must be able to accurately detect if the image contains a dog or a human. After that, if in the image is a dog, it has to classify the dog breed. If it is a human, it has to tell you which dog breed that human resembles to.

The task of assigning breed to dogs from images is considered exceptionally challenging. To see why, consider that even a human would have trouble distinguishing between a Brittany and a Welsh Springer Spaniel.

It is not difficult to find other dog breed pairs with minimal inter-class variation (for instance, Curly-Coated Retrievers and American Water Spaniels). Likewise, recall that labradors come in yellow, chocolate, and black. Your vision-based algorithm will have to conquer this high intra-class variation to determine how to classify all of these different shades as the same breed.

I also want to mention that random chance presents an exceptionally low bar: setting aside the fact that the classes are slightly imbalanced, a random guess will provide a correct answer roughly 1 in 133 times, which corresponds to an accuracy of less than 1%.

The steps to achieve our dog app are:

Import Datasets
Detect Humans
Detect Dogs
Create a CNN to Classify Dog Breeds (from Scratch)
Create a CNN to Classify Dog Breeds (using Transfer Learning)
Write the Algorithm
Test the Algorithm

For this app, I have chosen to use Pytorch because it is really easy to develop and debug a deep learning model using CNNs and pretrained models like VGG16, Densenet or Resnet to name a few.

Metrics

To evaluate model performance I used Accuracy which is defined as the total number of correct predictions divided by the total number of predictions. The reason why I used accuracy is that it will tell us the fractions of predictions our model got right. This is the reason why it is so easy to understand. Accuracy is one of the most used evaluating metrics for classification models.

Provided by the ML Crash Course by Google

I used accuracy to assess the human and dog detector on the first 100 images from the dataset. Also, I used accuracy to evaluate the from scratch CNN model and the one created using Transfer Learning. Used the test set to find out how many images my models got right. There are some other metrics we could have looked at like Precision and Recall.

As for the training, I used the CrossEntropyLoss loss function provided by PyTorch out of the box. It measures the performance of a classification model whose output is a probability value between 0 and 1. The reason why I used the Log Loss is that it works by penalizing the false classifications. It works well for multi-class classification. The Cross Entropy Loss is useful when training a classification problem with C classes.

To optimize the loss, I first computed the gradient of the loss with respect to the model’s parameter and used the stochastic gradient descent optimizer provided by PyTorch to perform a single optimization step and update the parameter. Typically, a neural network model is trained using the stochastic gradient descent optimization algorithm and weights are updated using the backpropagation of error algorithm. For the validation, I only calculated the Cross Entropy Loss, without the backward pass and the optimization step as we only want to evaluate our trained model.

For the model created using Transfer Learning, I used the Adam optimizer which implements the Adam algorithm to optimize the loss. Adam is different to classical stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. The reason why I used Adam is because it computes individual adaptive learning rate for different parameters allowing our model to train better and converge to an optimal solution.

Data Exploration and visualization

Datasets and Inputs

The datasets where provided by Udacity and we had the following number of images:

There are 13233 total human images.
There are 8351 total dog images.

Data visualization

In this part, I’ll build data visualisations to further convey the information associated with our dataset.

I will explore the dogs dataset to check:

how the number of images are distributed among dog breeds;
how many unique dog breeds there are;
if there is a semnificative difference between the number of images for each breed.

We can get the dog breed from dog images.

We can see that our filename structure contains the dog breed. Therefore, I will split the filename and get the dog breed information from it.

Now we have our dog breed and this also needs a split. We can see that it contains a number continued by a dot.

We have 133 unique dog breeds in our file distribution. Our distribution of dog images goes from 96 for the Alaskan Malamute to a minimum of 33 for the Norwegian buhund and Xoloitzcuintli.

These are top ten the dog breeds in our dataset, at around 90 images.

These are the last ten dog breeds in our dataset, at around 30–40 images.

Methodology

Data Preprocessing

For the human detector:

the images are converted to grayscale.

# convert BGR image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

For the dog detector:

the image is resized to 255 pixels;
cropped the given image at the center to 224 pixels;
converted into a tensor;
Normalized with a mean and standard deviation.

# transform the image
in_transform = transforms.Compose([
                      transforms.Resize(255),
                      transforms.CenterCrop(224),
                      transforms.ToTensor(),
                      transforms.Normalize([0.485, 0.456, 0.406],
                                           [0.229, 0.224, 0.225])])

For the CNN models

I created three separate data loaders for the training, validation, and test datasets of dog images. You may find this documentation on custom datasets to be a useful resource. If you are interested in augmenting your training and/or validation data, check out the wide variety of transforms!

For the training data, I performed simple data augmentation by randomly flipping and randomly rotating by a range of +/-30 degrees. This allows for rotation invariance because the angle of the dogs doesn’t matter. Therefore, I add images to the training set with random rotations. These methods will help us reduce overfitting.
For the validation and testing data, images are only resized and center cropped for a 224*224 size. Our images are transformed into a tensor so they can be a valid input for the model.
After that, I normalize the tensors with a mean and standard deviation.

# Define transforms for the training data and testing data
data_transforms = {'train' : transforms.Compose(
                    [transforms.RandomRotation(30),                                        
                     transforms.RandomResizedCrop(224),
                     transforms.RandomHorizontalFlip(),
                     transforms.ToTensor(),
                     transforms.Normalize([0.485, 0.456, 0.406],
                                          [0.229, 0.224, 0.225])]),
                    'test' : transforms.Compose(
                    [transforms.Resize(255),
                     transforms.CenterCrop(224),
                     transforms.ToTensor(),
                     transforms.Normalize([0.485, 0.456, 0.406],
                                          [0.229, 0.224, 0.225])]),
                    'valid' : transforms.Compose(
                    [transforms.Resize(255),
                     transforms.CenterCrop(224),
                     transforms.ToTensor(),
                     transforms.Normalize([0.485, 0.456, 0.406],
                                          [0.229, 0.224, 0.225])])}

Implementation

How to detect humans in an image?

In this section, I used OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images.

Therefore, to detect a human face, we create a function that takes a string-valued file path to an image as input and appears in the code block below.

def face_detector(img_path):
    """
        INPUT
            img_path - a string-valued file path to an image
        OUTPUT
            returns "True" if face is detected in image stored at img_path
    """
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray)
    return len(faces) > 0

Assess the Human Face Detector

In order to test the performance of our Face Detector we will check the first 100 images and calculate the percentage of the first 100 images in human_files that have a detected human face and the percentage of the first 100 images in dog_files that have a detected human face. Our algorithm falls short of this goal, but still gives acceptable performance. In the first 100 images in human_files, 98% have a detected human face. In the first 100 images in dog_files, 17% have a detected human face.

How to detect dogs in an image?

To accurately detect dogs in an image, I used a pre-trained VGG-16 model on ImageNet, a very large, very popular dataset used for image classification and other vision tasks.

This is the code that will return a prediction (derived from the 1000 possible categories in ImageNet) for the object that is contained in the image.

from PIL import Image
import torchvision.transforms as transforms

def VGG16_predict(img_path):
    '''
    Use pre-trained VGG-16 model to obtain index corresponding to 
    predicted ImageNet class for image at specified path.
    
    INPUT:
        img_path - a string-valued file path to an image
    OUTPUT:
        Index corresponding to VGG-16 model's prediction
    '''
    
    ## Load and pre-process an image from the given img_path
    ## Return the *index* of the predicted class for that image
    
    # load the image
    image = Image.open(img_path)
    # transform the image
    in_transform = transforms.Compose([
                        transforms.Resize(255),
                        transforms.CenterCrop(224),
                        transforms.ToTensor(),
                        transforms.Normalize([0.485, 0.456, 0.406],
                                    [0.229, 0.224, 0.225])])

    # discard the transparent, alpha channel (that's the :3) and add the batch dimension
    image = in_transform(image)[:3,:,:].unsqueeze(0)
    VGG16.eval()
    
    if use_cuda:
        image = image.cuda()
        
    output = VGG16(image)
    
    # predicted class
    return output.data.argmax(dim=1)

Our dog detector does an excellent job at detecting dogs in images. It has 100% accuracy on the first 100 images of dogs and humans. Now, we want to be able to classify the dog breed.

Now that we have functions for detecting humans and dogs in images, we need a way to predict breed from images. In this step, I will create a CNN that classifies dog breeds and have a test accuracy of at least 10%.

Try to create my own CNN to classify dog breeds?

Now that we have functions for detecting humans and dogs in images, we need a way to predict dog breed from images. In this step, I will create a CNN that classifies dog breeds and have a test accuracy of at least 10%.

Baseline Model Architecture

This is the code for my model:

import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True


import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    """
    A class to define the CNN architecture of our model.
    """
    def __init__(self):
        super(Net, self).__init__()
        """
            In the constructor we define three convolutional layers, a max pooling layer, 
            two fully connected layers and a dropout layer with an input probability of 0.3.
        """
        ## Define layers of a CNN
        # 224*224*3
        self.conv1 = nn.Conv2d(3, 32, 3, stride=2, padding=1)
        # 56*56*32
        self.conv2 = nn.Conv2d(32, 64, 3, stride=2, padding=1)
        # 14*14*64
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        # 7*7*128

        # pool
        self.pool = nn.MaxPool2d(2, 2)
        
        # fully-connected
        self.fc1 = nn.Linear(7*7*128, 500)
        self.fc2 = nn.Linear(500, num_classes) 
        
        # drop-out
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        """
            A function that defines forward behavior. In the forward function we accept a Tensor of input data 
            and we must return a Tensor of output data. We use Modules defined in the constructor as.
            INPUT
                x - a Tensor of input data
            OUTPUT
                x - a Tensor of output data
        """
        
        # Pass data through conv1
        # Use the rectified-linear activation function over x
        x = F.relu(self.conv1(x))
        # Run max pooling over x
        x = self.pool(x)
        # Pass data through conv2
        # Use the rectified-linear activation function over x
        x = F.relu(self.conv2(x))
        # Run max pooling over x
        x = self.pool(x)
        # Pass data through conv3
        # Use the rectified-linear activation function over x
        x = F.relu(self.conv3(x))
        # Run max pooling over x
        x = self.pool(x)
        
        # flatten the tensor
        x = x.view(-1, 7*7*128)
        
        # Pass data through dropout
        x = self.dropout(x)
        # Pass data through fc1 and apply relu
        x = F.relu(self.fc1(x))
        
        # Pass data through dropout
        x = self.dropout(x)
        # Pass data through fc2 and apply relu
        x = self.fc2(x)
        return x

How did I get to my final CNN architecture?

Net(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=6272, out_features=500, bias=True)
  (fc2): Linear(in_features=500, out_features=133, bias=True)
  (dropout): Dropout(p=0.3)
)

First, I tried different CNN model architectures to classify dog breeds. I found out that this particular one, which downsizes my input’s x-y dimensions by a factor of 2 to the power of 5 and increases its depth from 3 to 128, is suitable for my task of obtaining an accuracy of at least 10%.

The model’s architecture contains 3 convolutional layers which gradually increases the depth of the input from 3 to 32 to 64 and finally to 128. They all have a convolution kernel size of 3*3. These image filters extract features like edges of objects which allows us to make better predictions. Also, in order to pass the 3*3 kernel to all 224 pixels of our input, we need to add a padding of 1 that pads it with a border of 0, black pixels. The stride of the convolution, or the amount by which the filter slides over the image is 2. It makes the convolutional layer about half the width and height as the input image.

A ReLu activation function is applied to the output of these filters to standardize their output values. After the convolutional layer, to reduce the dimensionality of our input arrays by a factor of two, I will apply 2*2 Max Pooling layers. In a 2*2 window it returns the maximum of the pixels contained in the window. It sees 4 pixels and returns one, decreasing the x-y dimensions by a factor of 2.

As it moves through the model, the input array transforms its dimensions:

From 224*224*3
After first convolutional layer: 112*112*32
After first Max pooling layer: 56*56*32
After second convolutional layer: 28*28*64
After second Max pooling layer: 14*14*64
After third convolutional layer (has a stride of 1): 14*14*128
After third Max pooling layer: 7*7*128

Then, the input array is flattened and passed through two fully connected layers which will give us the predictions. I have specified a dropout probability of 0.3 and a relu activation function.

How well did my model performed?

So, after 50 epochs I got a test accuracy of 25%, not that great.

Refinement

Let’s use Transfer Learning to create a CNN to classify dog breeds

We saw that using a from scratch model to predict dog breed is not a very promising strategy. Therefore, we need a way to improve our results and I will use transfer learning to do so. This is the code using a densenet121 pretrained model to classify dog breeds:

import torchvision.models as models
import torch.nn as nn

model_transfer = models.densenet121(pretrained=True)

# freeze parameters so we won't backprop through them
for param in model_transfer.features.parameters():
    param.requires_grad = False
    
## Specify model architecture 

# get the in_features from classifier
n_inputs = model_transfer.classifier.in_features
# # add last layer
model_transfer.classifier = nn.Linear(n_inputs, num_classes)

I’ll outline the steps I took to get to my final CNN architecture and my reasoning at each step.

Trained a convolutional neural network for image classification with learning transfer.
Initialized densenet121 pretrained model and freezed its parameters.
Reshaped the final layer to have the same number of outputs as the number of classes from our dataset.
Defined Adam for the optimization algorithm that will update the classifier’s parameters.
Defined the loss function and trained the model for three epochs which gave an accuracy of 80% on the testing set.

Results

Model Evaluation and Validation

These are results of our training step:

And the test set:

So, as we can see, after just three epochs I got an 80% test accuracy and a 0.67 test loss. Cool, right? Now we can really build our dog app.

Justification

In order to improve the performance of the Densenet 121 model I used data augmentation techniques like:

For the training data, I performed simple data augmentation by randomly flipping and randomly rotating by a range of +/-30 degrees. This allows for rotation invariance because the angle of the dogs doesn’t matter. Therefore, I add images to the training set with random rotations. These methods will help us reduce overfitting.
For the validation and testing data, images are only resized and center cropped for a 224*224 size. Our images are transformed into a tensor so they can be a valid input for the model.
Then converted the images to tensors and normalized with a mean and standard deviation.

The Densenet 121 model performed well, in only three epochs of training I obtained an accuracy of 80% on the testing set. Started with a training loss of about 3 that decreased at about 1.24. The validation loss started from about 1.26 to 0.63 which is consistent to our training loss of about 0.67.

Write the algorithm

We have all the building blocks to build our app. But first, we need a method to predict dog breed with the transfer learning model in order to use it. So, I implemented this function that takes an image path as input and returns the dog breed (Affenpinscher, Afghan hound, etc) that is predicted by your model.

Now, we have to write an algorithm that accepts a file path to an image and first determines whether the image contains a human, dog, or neither. Then,

if a dog is detected in the image, return the predicted breed.
if a human is detected in the image, return the resembling dog breed.
if neither is detected in the image, provide output that indicates an error.

def run_app(img_path):
    """
    INPUT
        img_path - a string-valued file path to an image
    """
    ## handle cases for a human face, dog, and neither
    image = Image.open(img_path)
    plt.imshow(image)
    plt.show()
    
    if dog_detector(img_path):
        print(predict_breed_transfer(img_path))
    elif face_detector(img_path):
        print(f"You look like a {predict_breed_transfer(img_path)}")
    else:
        print('Error')

Let’s test it

Conclusions

Creating a CNN to classify dog breeds from scratch doesn’t give great accuracy. After training my from scratch model for 50 epochs I did get an accuracy of 26%. Not that great and did take some time to train.
On the other hand, using learning to create a CNN that can identify dog breeds from images proved to be more consistent with our task. I used a pre-trained densenet121 model, freezed its parameters and reshaped the final layer to have the same number of outputs as the number of classes from our dataset. Trained the model only for three epochs and got an 81% accuracy. This is a great improvement which let us use this model for our app.
The model does a pretty good job at classifying dog breeds and gives a human a funny dog breed estimator but there are ways to improve our algorithm.
We can finetune the transfer model to obtain higher accuracy on the dog breed classifier.
We can add more real-world images in our dataset for the model to train and improve performance.
We can add more data augmentation transforms so that the model sees variation of the input images.

Reflection and Improvement

Is the output better than you expected :) ? Or worse :( ? Provide at least three possible points of improvement for your algorithm.

The model does a pretty good job at classifying dog breeds and gives a human a funny dog breed estimator but there are ways to improve our algorithm.

We can finetune the transfer model to obtain higher accuracy on the dog breed classifier.
We can add more real-world images in our dataset for the model to train and improve performance.
We can add more data augmentation transforms so that the model sees variation of the input images.

Also, you can find the whole project on Github, following this link.