Image by VislazFotosas via Pixabay

Conquer the basics of CNNs and image classification in mere minutes

This article first appeared in Towards Data Science

Curious about image classification?

“One thing that struck me early is that you don’t put into a photograph what’s going to come out. Or, vice versa, what comes out is not what you put in.”

― Diane Arbus

A notification pops up on your favorite social network that someone posted a picture that might have you in it.

It’s right.

It’s the worst picture of you ever.

How did that happen?

Image classification!

The convolutional neural network (CNN) is a class of deep learning neural networks. CNNs represent a huge breakthrough in image recognition. They’re most commonly used to analyze visual imagery and are frequently working behind the scenes in image classification. They can be found at the core of everything from Facebook’s photo tagging to self-driving cars. They’re working hard behind the scenes in everything from healthcare to security.

They’re fast and they’re efficient. But how do they work?

Image classification is the process of taking an input (like a picture) and outputting a class (like “cat”) or a probability that the input is a particular class (“there’s a 90% probability that this input is a cat”). You can look at a picture and know that you’re looking at a terrible shot of your own face, but how can a computer learn to do that?

With a convolutional neural network!

A CNN has

  • Convolutional layers
  • ReLU layers
  • Pooling layers
  • a Fully connected layer

A classic CNN architecture would look something like this:

Input ->Convolution ->ReLU ->Convolution ->ReLU ->Pooling ->
ReLU ->Convolution ->ReLU ->Pooling ->Fully Connected

A CNN convolves (not convolutes…) learned features with input data and uses 2D convolutional layers. This means that this type of network is ideal for processing 2D images. Compared to other image classification algorithms, CNNs actually use very little preprocessing. This means that they can learn the filters that have to be hand-made in other algorithms. CNNs can be used in tons of applications from image and video recognition, image classification, and recommender systems to natural language processing and medical image analysis.

CNNs are inspired by biological processes. They’re based on some cool research done by Hubel and Wiesel in the 60s regarding vision in cats and monkeys. The pattern of connectivity in a CNN comes from their research regarding the organization of the visual cortex. In a mammal’s eye, individual neurons respond to visual stimuli only in the receptive field, which is a restricted region. The receptive fields of different regions partially overlap so that the entire field of vision is covered. This is the way that a CNN works!

Image by NatWhitePhotography on Pixabay

CNNs have an input layer, and output layer, and hidden layers. The hidden layers usually consist of convolutional layers, ReLU layers, pooling layers, and fully connected layers.

  • Convolutional layers apply a convolution operation to the input. This passes the information on to the next layer.
  • Pooling combines the outputs of clusters of neurons into a single neuron in the next layer.
  • Fully connected layers connect every neuron in one layer to every neuron in the next layer.

In a convolutional layer, neurons only receive input from a subarea of the previous layer. In a fully connected layer, each neuron receives input from every element of the previous layer.

A CNN works by extracting features from images. This eliminates the need for manual feature extraction. The features are not trained! They’re learned while the network trains on a set of images. This makes deep learning models extremely accurate for computer vision and image classification tasks. CNNs learn feature detection through tens or hundreds of hidden layers. Each layer increases the complexity of the learned features.


  • starts with an input image
  • applies many different filters to it to create a feature map
  • applies a ReLU function to increase non-linearity
  • applies a pooling layer to each feature map
  • flattens the pooled images into one long vector.
  • inputs the vector into a fully connected artificial neural network.
  • processes the features through the network. The final fully connected layer provides the “voting” of the classes that we’re after.
  • trains through forward propagation and backpropagation for many, many epochs. This repeats until we have a well-defined neural network with trained weights and feature detectors.

So what does that mean?

At the very beginning of the image classification process, an input image is broken down into pixels.

For a black and white image, those pixels are interpreted as a 2D array (for example, 2×2 pixels). Every pixel has a value between 0 and 255. (Zero is completely black and 255 is completely white. The greyscale exists between those numbers.) Based on that information, the computer can begin to work on the data.

For a color image, this is a 3D array with a blue layer, a green layer, and a red layer. Each one of those colors has its own value between 0 and 255. The color can be found by combining the values in each of the three layers.

What are the basic building blocks of a CNN?


The main purpose of the convolution step is to extract features from the input image. The convolutional layer is always the first step in a CNN.

You have an input image, a feature detector, and a feature map. You take the filter and apply it pixel block by pixel block to the input image. You do this through the multiplication of the matrices.

Let’s say you have a flashlight and a sheet of bubble wrap. Your flashlight shines a 5-bubble x 5-bubble area. To look at the entire sheet, you would slide your flashlight across each 5×5 square until you’d seen all the bubbles.

Photo by stux on Pixabay

The light from the flashlight here is your filter and the region you’re sliding over is the receptive field. The light sliding across the receptive fields is your flashlight convolving. Your filter is an array of numbers (also called weights or parameters). The distance the light from your flashlight slides as it travels (are you moving your filter over one row of bubbles at a time? Two?) is called the stride. For example, a stride of one means that you’re moving your filter over one pixel at a time. The convention is a stride of two.

The depth of the filter has to be the same as the depth of the input, so if we were looking at a color image, the depth would be 3. That makes the dimensions of this filter 5x5x3. In each position, the filter multiplies the values in the filter with the original values in the pixel. This is element wise multiplication. The multiplications are summed up, creating a single number. If you started at the top left corner of your bubble wrap, this number is representative of the top left corner. Now you move your filter to the next position and repeat the process all around the bubble wrap. The array you end up with is called a feature map or an activation map! You can use more than one filter, which will do a better job of preserving spatial relationships.

You’ll specify parameters like the number of filters, the filter size, the architecture of the network, and so on. The CNN learns the values of the filters on its own during the training process. You have a lot of options that you can work with to make the best image classifier possible for your task. You can choose to pad the input matrix with zeros (zero padding) to apply the filter to bordering elements of the input image matrix. This also allows you to control the size of the feature maps. Adding zero padding is wide convolution. Not adding zero padding is narrow convolution.

This is basically how we detect images! We don’t look at every single pixel of an image. We see features like a hat, a red dress, a tattoo, and so on. There’s so much information going into our eyes at all times that we couldn’t possibly deal with every single pixel of it. We’re allowing our model to do the same thing.

The result of this is the convolved feature map. It’s smaller than the original input image. This makes it easier and faster to deal with. Do we lose information? Some, yes. But at the same time, the purpose of the feature detector is to detect features, which is exactly what this does.

We create many feature maps to get our first convolutional layer. This allows us to identify many different features that the program can use to learn.

Feature detectors can be set up with different values to get different results. For example, a filter can be applied that can sharpen and focus an image or blur an image. That would give equal importance to all the values. You can do edge enhancement, edge detection, and more. You would do that by applying different feature detectors to create different feature maps. The computer is able to determine which filters make the most sense and apply them.

The primary purpose here is to find features in your image, put them into a feature map, and still preserve the spatial relationship between pixels. That’s important so that the pixels don’t get all jumbled up.

Let’s visualize this stuff!

Say hello to my little friend:

Photo by Kirgiz03 on Pixabay

We’re going to use this guy for our input image.

We’ll make him black and white

import cv2
import matplotlib.pyplot as plt
%matplotlib inline
img_path = 'data/pixabay_Kirgiz03.jpg'
# Load color image 
bgr_img = cv2.imread(img_path)
# Convert to grayscale
gray_img = cv2.cvtColor(bgr_img, cv2.COLOR_BGR2GRAY)
# Normalize, rescale entries to lie in [0,1]
gray_img = gray_img.astype("float32")/255
# Plot image
plt.imshow(gray_img, cmap='gray')

Let’s define and visualize our filters

import numpy as np
filter_vals = np.array([[-1, -1, 1, 1], [-1, -1, 1, 1], [-1, -1, 1, 1], [-1, -1, 1, 1]])
print('Filter shape: ', filter_vals.shape)

Filter shape: (4, 4)

# Define four different filters, all of which are linear combinations of the `filter_vals` defined above
filter_1 = filter_vals
filter_2 = -filter_1
filter_3 = filter_1.T
filter_4 = -filter_3
filters = np.array([filter_1, filter_2, filter_3, filter_4])
# Print out the values of filter 1 as an example
print('Filter 1: n', filter_1)

and we see:

Filter 1: 
 [[-1 -1  1  1]
 [-1 -1  1  1]
 [-1 -1  1  1]
 [-1 -1  1  1]]

Here’s a visualization of our four filters

Now let’s define a convolutional layer (I’m loving PyTorch right now, so that’s what we’re using here.)

import torch
import torch.nn as nn
import torch.nn.functional as F
# Neural network with one convolutional layer with four filters
class Net(nn.Module):
    def __init__(self, weight):
        super(Net, self).__init__()
        # Initializes the weights of the convolutional layer to be the weights of the 4 defined filters
        k_height, k_width = weight.shape[2:]
        # Assumes there are 4 grayscale filters
        self.conv = nn.Conv2d(1, 4, kernel_size=(k_height, k_width), bias=False)
        self.conv.weight = torch.nn.Parameter(weight)
def forward(self, x):
        # Calculates the output of a convolutional layer pre- and post-activation
        conv_x = self.conv(x)
        activated_x = F.relu(conv_x)
        # Returns both layers
        return conv_x, activated_x
# Instantiate the model and set the weights
weight = torch.from_numpy(filters).unsqueeze(1).type(torch.FloatTensor)
model = Net(weight)
# Print out the layer in the network

We’ll see

  (conv): Conv2d(1, 4, kernel_size=(4, 4), stride=(1, 1), bias=False)

Add a little more code

def viz_layer(layer, n_filters= 4):
    fig = plt.figure(figsize=(20, 20))
    for i in range(n_filters):
        ax = fig.add_subplot(1, n_filters, i+1, xticks=[], yticks=[])
        # Grab layer outputs
        ax.imshow(np.squeeze(layer[0,i].data.numpy()), cmap='gray')
        ax.set_title('Output %s' % str(i+1))

Then a little more

# Plot original image
plt.imshow(gray_img, cmap='gray')
# Visualize all of the filters
fig = plt.figure(figsize=(12, 6))
fig.subplots_adjust(left=0, right=1.5, bottom=0.8, top=1, hspace=0.05, wspace=0.05)
for i in range(4):
    ax = fig.add_subplot(1, 4, i+1, xticks=[], yticks=[])
    ax.imshow(filters[i], cmap='gray')
    ax.set_title('Filter %s' % str(i+1))
# Convert the image into an input tensor
gray_img_tensor = torch.from_numpy(gray_img).unsqueeze(0).unsqueeze(1)
# Get the convolutional layer (pre and post activation)
conv_layer, activated_layer = model(gray_img_tensor)
# Visualize the output of a convolutional layer

And we can visualize the output of a convolutional layer before a ReLu activation function is applied!

Now let’s create a custom kernel using a Sobel operator as an edge detection filter. The Sobel filter is very commonly used in edge detection. It does a good job of finding patterns in intensity in an image. Applying a Sobel filter to an image is a way of taking an approximation of the derivative of the image separately in the x- or y-direction.

We’ll convert our little dude to grayscale for filtering

gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
plt.imshow(gray, cmap='gray')

Here we go!

# 3x3 array for edge detection
sobel_y = np.array([[ -1, -2, -1], 
                   [ 0, 0, 0], 
                   [ 1, 2, 1]])
sobel_x = np.array([[ -1, 0, 1], 
                   [ 0, 0, 0], 
                   [ 1, 2, 1]])
filtered_image = cv2.filter2D(gray, -1, sobel_y)
plt.imshow(filtered_image, cmap='gray')

Want to check out the math? Take a look at Introduction to Convolutional Neural Networks by Jianxin Wu

ReLU layer

The ReLU (rectified linear unit) layer is another step to our convolution layer. You’re applying an activation function onto your feature maps to increase non-linearity in the network. This is because images themselves are highly non-linear! It removes negative values from an activation map by setting them to zero.

Convolution is a linear operation with things like element wise matrix multiplication and addition. The real-world data we want our CNN to learn will be non-linear. We can account for that with an operation like ReLU. You can use other operations like tanh or sigmoid. ReLU, however, is a popular choice because it can train the network faster without any major penalty to generalization accuracy.

Check out C.-C. Jay Kuo Understanding Convolutional Neural Networks With a Mathematical Model.

Want to dig deeper? Try Kaiming He, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

If you need a little more info about the absolute basics of activation functions, you can find that here!

Here’s how our little buddy is looking after a ReLU activation function turns all of the negative pixel values black



The last thing you want is for your network to look for one specific feature in an exact shade in an exact location. That’s useless for a good CNN! You want images that are flipped, rotated, squashed, and so on. You want lots of pictures of the same thing so that your network can recognize an object (say, a leopard) in all the images. No matter what the size or location. No matter what the lighting or the number of spots, or whether that leopard is fast asleep or crushing prey. You want spatial variance! You want flexibility. That’s what pooling is all about.

Pooling progressively reduces the size of the input representation. It makes it possible to detect objects in an image no matter where they’re located. Pooling helps to reduce the number of required parameters and the amount of computation required. It also helps control overfitting.

Overfitting can be kind of like when you memorize super specific details before a test without understanding the information. When you memorize details, you can do a great job with your flashcards at home. You’ll fail a real test, though, if you’re presented with new information.

(Another example: if all of the dogs in your training data have spots and dark eyes, your network will believe that for an image to be classified as a dog, it must have spots and dark eyes. If you test your data with that same training data, it will do an amazing job of classifying dogs correctly! But if your outputs are only “dog” and “cat,” and your network is presented with new images containing, say, a Rottweiler and a Husky, it will probably wind up classifying both the Rottweiler and the Husky as cats. You can see the problem!)

Photo by Hybrid on Unsplash

Without variance, your network will be useless with images that don’t exactly match the training data. Always, always, always keep your training and testing data separate! If you test with the data you trained on, your network has the information memorized! It will do a terrible job when it’s introduced to any new data.

Overfitting is not cool.

So for this step, you take the feature map, apply a pooling layer, and the result is the pooled feature map.

The most common example of pooling is max pooling. In max pooling, the input image is partitioned into a set of areas that don’t overlap. The outputs of each area are the maximum value in each area. This makes a smaller size with fewer parameters.

Max pooling is all about grabbing the maximum value at each spot in the image. This gets rid of 75% of the information that is not the feature. By taking the maximum value of the pixels, you’re accounting for distortion. If the feature rotates a little to the left or right or whatever, the pooled feature will be the same. You’re reducing the size and parameters. This is great because it means that the model won’t overfit on that information.

You could use average pooling or sum pooling, but they aren’t common choices. Max pooling tends to perform better than both in practice. In max pooling, you’re taking the largest pixel value. In average pooling, you take the average of all the pixel values at that spot in the image. (Actually, there’s a trend now towards using smaller filters or discarding pooling layers entirely. This is in response to an aggressive reduction in representation size.)

Want to look a little more at why you might want to choose max pooling and why you might prefer a stride of two pixels? Check out Dominik Scherer et. al, Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition.

If you go here you can check out a really interesting 2D visualization of a convolutional layer. Draw a number in the box on the left-hand side of the screen and then really go through the output. You can see the convolved and pooled layers as well as the guesses. Try hovering over a single pixel so you can see where the filter was applied.

So now we have an input image, an applied convolutional layer, and an applied pooling layer.

Let’s visualize the output of the pooling layer!

We were here:

The pooling layer takes as input the feature maps pictured above and reduces the dimensionality of those maps. It does this by constructing a new, smaller image of only the maximum (brightest) values in a given kernel area.

See how the image has changed size?

Cool, right?


This is a pretty simple step. You flatten the pooled feature map into a sequential column of numbers (a long vector). This allows that information to become the input layer of an artificial neural network for further processing.

Fully connected layer

At this step, we add an artificial neural network to our convolutional neural network. (Not sure about artificial neural networks? You can learn about them here!)

The main purpose of the artificial neural network is to combine our features into more attributes. These will predict the classes with greater accuracy. This combines features and attributes that can predict classes better.

At this step, the error is calculated and then backpropagated. The weights and feature detectors are adjusted to help optimize the performance of the model. Then the process happens again and again and again. This is how our network trains on the data!

How do the output neurons work when there’s more than one?

First, we have to understand what weights to apply to the synapses that connect to the output. We want to know which of the previous neurons are important for the output.

If, for example, you have two output classes, one for a cat and one for a dog, a neuron that reads “0” is absolutely uncertain that the feature belongs to a cat. A neuron that reads “1 is absolutely certain that the feature belongs to a cat. In the final fully connected layer, the neurons will read values between 0 and 1. This signifies different levels of certainty. A value of 0.9 would signify a certainty of 90%.

The cat neurons that are certain when a feature is identified know that the image is a cat. They say the mathematical equivalent of, “These are my neurons! I should be triggered!” If this happens many times, the network learns that when certain features fire up, the image is a cat.

Photo by Linnea Sandbakk on Unsplash

Through lots of iterations, the cat neuron learns that when certain features fire up, the image is a cat. The dog (for example) neuron learns that when certain other features fire up, the image is a dog. The dog neuron learns that for example again, the “big wet nose” neuron and the “floppy ear” neuron contribute with a great deal of certainty to the dog neuron. It gives greater weight to the “big wet nose” neuron and the “floppy ear” neuron. The dog neuron learns to more or less ignore the “whiskers” neuron and the “cat-iris” neuron. The cat neuron learns to give greater weight to neurons like “whiskers” and “cat-iris.”

(Okay, there aren’t actually “big wet nose” or “whiskers” neurons. But the detected features do have distinctive features of the specific class.)

Once the network has been trained, you can pass in an image and the neural network will be able to determine the image class probability for that image with a great deal of certainty.

The fully connected layer is a traditional Multi-Layer Perceptron. It uses a classifier in the output layer. The classifier is usually a softmax activation function. Fully connected means every neuron in the previous layer connects to every neuron in the next layer. What’s the purpose of this layer? To use the features from the output of the previous layer to classify the input image based on the training data.

Once your network is up and running you can see, for example, that you have a 95% probability that your image is a dog and a 5% probability that your image is a cat.

Photo by Alexas_Fotos on Pixabay

Why do these numbers add up to 1.0? (0.95 + 0.05)

There isn’t anything that says that these two outputs are connected to each other. What is it that makes them relate to each other? Essentially, they wouldn’t, but they do when we introduce the softmax function. This brings the values between 0 and 1 and makes them add up to 1 (100%). (You can read all about this on Wikipedia.) The softmax function takes a vector of scores and squashes it to a vector of values between 0 and 1 that add up to 1.

After you apply a softmax function, you can apply the loss function. Cross entropy often goes hand in hand with softmax. We want to minimize the loss function so we can maximize the performance of our network.

At the beginning of backpropagation, your output values would be tiny. That’s why you might choose cross entropy loss. The gradient would be very low and it would be hard for the neural network to start adjusting in the right direction. Using cross entropy helps the network assess even a tiny error and get to the optimal state faster.

Want more? Check out

Now what?

At this point, everything is trained through forward- and backward propagation through many, many epochs. We wind up with a very well defined neural network where all the weights and features are trained. Now we have something that can recognize and classify images! (Not sure about forward propagation and backpropagation? Check out the absolute basics here!)

So what just happened?

We started with an input image and applied multiple different features to create a feature map. We applied the ReLU to increase non-linearity and applied a pooling layer to each feature map. (We did that to make sure we have spatial variance in our images, to reduce the size of the images, and to avoid overfitting of the model to the data while still preserving the features we’re after.) We flattened all the pooled images into one long vector. We input the vector into our fully connected artificial neural network. This is where all the features were processed through the network. This gave us the final fully connected layer which provided the “voting” of the classes that we’re after. All of this was trained through forward propagation and backpropagation until we wound up with a well defined neural network where the weights and feature detectors were trained.

Now we can recognize and classify images!

Photo by Lucas Sankey on Unsplash

You are well on your way to being able to do this:

(Do you want to do that? Here’s a clear and complete blueprint for creating an incredibly accurate image classifier with PyTorch! You can create an image classifier that can tell you with a huge degree of certainty what species of flower you’re looking at!)

Want to learn more? Check out

You might also like

 Intro to Deep Learning

If you do anything awesome with this information, let me know in the responses below or reach out on twitter @annebonnerdata!

Subscribe ToThe Newsletter!

Want to stay in the conversation? Subscribe to the newsletter to receive the latest news and updates from Content Simplicity.

You have Successfully Subscribed!

Pin It on Pinterest

Share This