Convolutional Neural Networks in Python


Introduction to Convolutional Neural Networks in Python

Is not new in the deep learning,  they have published the past few years but a lot of buzzes on Convolutional Neural Networks in Python, mostly because of how they have revolutionized the field of Computer vision.

Don’t be panic, In this article, I will explain a basic background  of Convolutional Neural Networks in Python and understand how they work, and build a from scratch in Python.

NOTE: Build using only NumPy.

Let’s say I am assuming everyone has a basic knowledge of neural networks.

#here put neural network link.

Now, Ready. 

Convolutional Neural Networks in Python is to perform image classification, it means that identify the particular image belongs to that category. For example, deciding a given image cat or dog. Some of thinking it’s a simple task – why used CNN why not neural network? 

Obviously this question thought everyone and it is good to question.

Reason 1: Images are Big

For computer vision images are mandatory, nowadays are 224 x 224 0r larger. Now imagine multiplication this  number very complicated, and the image has number pixels. Imagine building a neural network to process 224 x 224 color images. Color image have 3 channels (RGB) in the images, that comes out 224 x 224 x 3 = 150,528 input features. A hidden layer in such a network might have 1024 nodes, so we have to train 150,528 x 1024 = 150 + million weights for the first layer alone. Now think of it this calculation our network would be huge and nearly possible to train.

Now understand, pixels are most useful in the context of their neighbors.  

Objects in images are made up of small, localized features, like the circular iris of an eye or the square corner of a piece of paper. Doesn’t it seem wasteful for every node in the first hidden layer to look at every pixel?

Reason 2: Positions can Change 

If you train the model on dog images, you want it to be detected to a dog regardless of where it appears in the image. Supposed model train on dog images, and a network that works well on a certain dog image, but then slightly change in the image like a shifted version of the same image. The dog would not active the same neurons, so the network would react completely differently.

This type of problem soon how Convolutional Neural Networks in Python can help us.


In this article, we used the MNIST datasets. The dataset contains a handwritten digit classification problem. It’s simple task, given an image classify it as a digit.

Each image in the MNIST dataset is 28×28 and contains a centered, grayscale digit.

Truth be told, a normal neural network would actually work just fine for this problem. You could treat each image as a 28 x 28 = 784-dimensional vector, feed that to a 784-dim input layer, stack a few hidden layers, and finish with an output layer of 10 nodes, 1 for each digit.

This would only work because the MNIST dataset contains small images that are centered, so we wouldn’t run into the aforementioned issues of size or shifting. Keep in mind throughout the course of this post, however, that most real-world image classification problems aren’t this easy.


What are Convolutional Neural Networks in Python?

They’re basically just neural networks that use Convolutional layers, a.k.a. Conv layers, which are based on the mathematical operation of convolution. Conv layers consist of a set of filters, which you can think of as just 2d matrices of numbers. Here’s an example 3×3 filter:


We can use an input image and a filter to produce an output image by convolving the filter with the input image. This consists of

  1. Overlaying the filter on top of the image at some location.
  2. Performing element-wise multiplication between the values in the filter and their corresponding values in the image.
  3. Sum of all the element-wise products. This sum is the output value for the destination pixel in the output image.
  4. Repeating for all locations.

Consider this tiny 4×4 grayscale image and this 3×3 filter:

The numbers in the image represent pixel intensities, where 0 is black and 255 is white. We’ll convolve the input image and the filter to produce a 2×2 output image:


To start, lets overlay our filter in the top left corner of the image:

Next, we perform element-wise multiplication between the overlapping image values and filter values. Here are the results, starting from the top left corner and going right, then down:

Image valueFilter value Result

Now the sum of all result,

62-33 = 29

Finally, we place our result in the destination pixel of our output image. Since our filter is overlaid in the top left corner of the input image, our destination pixel is the top-left pixel of the output image:

We do the same thing to generate the rest of the output image:

  1. How is this useful

Let’s zoom out for a second and see this at a higher level. What does convolving an image with a filter do? We can start by using the example 3×3 filter we’ve been using, which is commonly known as the vertical Sobel filter.


Similarly, there’s also a horizontal Sobel filter:


#put image

See what’s happening? Sobel filters are edge-detectors. The vertical Sobel filter detects vertical edges, and the horizontal Sobel filter detects horizontal edges. The output images are now easily interpreted: a bright pixel (one that has a high value) in the output image indicates that there’s a strong edge around there in the original image.

Can you see why an edge-detected image might be more useful than the raw image? Think back to our MNIST handwritten digit classification problem for a second. A Convolutional Neural Networks in Python trained on MNIST might look for the digit 1, for example, by using an edge-detection filter and checking for two prominent vertical edges near the center of the image. In general, convolution helps us look for specific localized image features (like edges) that we can use later in the network.


Remember convolving a 4×4 input image with a 3×3 filter earlier to produce a 2×2 output image? Oftentimes, we prefer to have the output image be the same size as the input image. To do this, we add zeros around the image so we can overlay the filter in more places. A 3×3 filter requires 1 pixel of padding:

This is called “same” padding, since the input and output have the same dimensions. Not using any padding, which is what we have been doing and will continue to do for this post, is sometimes referred to as “valid” padding.

  1. Conv Layers

Now that we understand how image convolution works and why it’s useful over neural networks, let’s see how it’s actually used in CNNs. As mentioned before, CNNs include ‘conv layers’ that use a set of filters to turn input images into output images. A conv layer’s primary parameter is the number of filters it has.

For our MNIST CNN, we’ll use a small conv layer with 8 filters as the initial layer in our network. This means it will turn the 28×28 input image into a 26x26x8 output volume:

Each of the 8 filters in the conv layer produces a 26×26 output, so stacked together they make up a 26x26x8 volume. All of this happens because of 3 ×\times× 3 (filter size) ×\times× 8 (number of filters) = only 72 weights!

Convolutional Tutorial

Now, let’s Time to put what we have learned into code! We’ll implement a conv layer’s feedforward portion, which takes care of convolving filters with an input image to produce an output volume. 

For simplicity, we’ll assume filters are always 3×3 .

Let’s dirty your hand…….

import numpy as np

class Conv3x3:
  # A Convolution layer using 3x3 filters.
  def __init__(self, num_filters):
    self.num_filters = num_filters

    # filters is a 3d array with dimensions (num_filters, 3,3)
   # We divide by 9 to reduce the variance of our initial values
    self.filters = np.random.randn(num_filters, 3, 3) / 9

The Conv3x3 class takes only one argument: the number of filters. In the constructor, we store the number of filters and initialize a random filters array using NumPy’s   randn() method.
Now, actually conversion
class Conv3x3:
  # ...

  def iterate_regions(self, image):
    Generates all possible 3x3 image regions using valid padding.
    - image is a 2d numpy array
    h, w = image.shape

    for i in range(h - 2):
      for j in range(w - 2):
        im_region = image[i:(i + 3), j:(j + 3)]
        yield im_region, i, j

  def forward(self, input):
    Performs a forward pass of the conv layer using the given input.
    Returns a 3d numpy array with dimensions (h, w, num_filters).
    - input is a 2d numpy array
    h, w = input.shape
    output = np.zeros((h - 2, w - 2, self.num_filters))

    for im_region, i, j in self.iterate_regions(input):
      output[i, j] = np.sum(im_region * self.filters, axis=(1,2))
    return output

iterate_regions() is a helper generator method that yields all valid 3×3 image regions for us. This will be useful for implementing the backwards portion of this class later on.

The line of code that actually performs the convolutions is highlighted above. Let’s break it down:

  • We have im_region, a 3×3 array containing the relevant image region.
  • We have self.filters, a 3d array.
  • We do im_region * self.filters, which uses numpy’s broadcasting feature to element-wise multiply the two arrays. The result is a 3d array with the same dimension as self.filters.
  • We np.sum()the result of the previous step using axis=(1, 2), which produces a 1d array of length num_filters where each element contains the convolution result for the corresponding filter.
  • We assign the result to output[i, j], which contains convolution results for pixel (i, j) in the output.

The sequence above is performed for each pixel in the output until we obtain our final output volume! Let’s give our code a test run:

import mnist
from conv import Conv3x3

# The mnist package handles the MNIST dataset for us!
# Learn more at
train_images = mnist.train_images()
train_labels = mnist.train_labels()

conv = Conv3x3(8)
output = conv.forward(train_images[0])
print(output.shape) # (26, 26, 8)


Neighboring pixels in images tend to have similar values, so conv layers will typically also produce similar values for neighboring pixels in outputs. As a result, much of the information contained in a conv layer’s output is redundant. For example, if we use an edge-detecting filter and find a strong edge at a certain location, chances are that we will also find relatively strong edges at locations 1 pixel shifted from the original one. However, these are all the same edge, We’re not finding anything new.

Pooling layers solve this problem. All they do is reduce the size of the input it’s given by (you guessed it) pooling values together in the input. The pooling is usually done by a simple operation like max, min, or average. Here’s an example of a Max Pooling layer with a pooling size of 2:

To perform max pooling, we traverse the input image in 2×2 blocks (because pool size = 2) and put the max value into the output image at the corresponding pixel. That’s it!

Pooling divides the input’s width and height by the pool size. For our MNIST CNN, we’ll place a Max Pooling layer with a pool size of 2.

Implementation Pool
We will implement a MaxPool2.

import numpy as np
class MaxPool2:
  # A Max Pooling layer using a pool size of 2.

  def iterate_regions(self, image):
    Generates non-overlapping 2x2 image regions to pool over.
    - image is a 2d numpy array
    h, w, _ = image.shape
    new_h = h // 2
    new_w = w // 2

    for i in range(new_h):
      for j in range(new_w):
        im_region = image[(i * 2):(i * 2 + 2), (j * 2):(j * 2 + 2)]
        yield im_region, i, j

  def forward(self, input):
    Performs a forward pass of the maxpool layer using the given input.
    Returns a 3d numpy array with dimensions (h / 2, w / 2, num_filters).
    - input is a 3d numpy array with dimensions (h, w, num_filters)
    h, w, num_filters = input.shape
    output = np.zeros((h // 2, w // 2, num_filters))

    for im_region, i, j in self.iterate_regions(input):
      output[i, j] = np.amax(im_region, axis=(0, 1))


import mnist
from conv import Conv3x3
from maxpool import MaxPool2

# The mnist package handles the MNIST dataset for us!
# Learn more at
train_images = mnist.train_images()
train_labels = mnist.train_labels()

conv = Conv3x3(8)
pool = MaxPool2()

output = conv.forward(train_images[0])
output = pool.forward(output)
print(output.shape) # (13, 13, 8)
  1. Softmax

To complete our CNN, we need a predictor to give it the ability to actually make predictions. We will do that by using the standard final layer for a multiclass classification problem: the Softmax layer, a fully-connected (dense) layer that uses this as its activation.

After the softmax transformation is applied, the digit represented by the node with the highest probability will be the output of the CNN!

Cross Entropy Loss

Sometimes you might be thinking, why transform the output into probabilities? We don’t actually need to use softmax to predict a digit. We just pick up the digit with the highest output from the networks.

Softmax functions actually help us quantify how sure our prediction is. More specifically, using softmax lets us use cross-entropy loss, which takes into account how sure we are of each prediction. Here’s how we calculate cross-entropy loss:

L =−ln(pc​) 
pc is the predicted probability for class c
 ln⁡ is the natural log
Softmax tutorial
import numpy as np

class Softmax:
  # A standard fully-connected layer with softmax activation.

  def __init__(self, input_len, nodes):
    # We divide by input_len to reduce the variance of our initial values
    self.weights = np.random.randn(input_len, nodes) / input_len
    self.biases = np.zeros(nodes)

  def forward(self, input):
    Performs a forward pass of the softmax layer using the given input.
    Returns a 1d numpy array containing the respective probability values.
    - input can be any array with any dimensions.
    input = input.flatten()

    input_len, nodes = self.weights.shape

    totals =, self.weights) + self.biases
    exp = np.exp(totals)
    return exp / np.sum(exp, axis=0)

Complete CNN
import mnist
import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax

# We only use the first 1k testing examples (out of 10k total)
# in the interest of time. Feel free to change this if you want.
test_images = mnist.test_images()[:1000]
test_labels = mnist.test_labels()[:1000]

conv = Conv3x3(8)                  # 28x28x1 -> 26x26x8
pool = MaxPool2()                  # 26x26x8 -> 13x13x8
softmax = Softmax(13 * 13 * 8, 10) # 13x13x8 -> 10

def forward(image, label):
  Completes a forward pass of the CNN and calculates the accuracy and
  cross-entropy loss.
  - image is a 2d numpy array
  - label is a digit
  # We transform the image from [0, 255] to [-0.5, 0.5] to make it easier
  # to work with. This is standard practice.
  out = conv.forward((image / 255) - 0.5)
  out = pool.forward(out)
  out = softmax.forward(out)

  # Calculate cross-entropy loss and accuracy. np.log() is the natural log.
  loss = -np.log(out[label])
  acc = 1 if np.argmax(out) == label else 0

  return out, loss, acc

print('MNIST CNN initialized!')

loss = 0
num_correct = 0
for i, (im, label) in enumerate(zip(test_images, test_labels)):
  # Do a forward pass.
  _, l, acc = forward(im, label)
  loss += l
  num_correct += acc

  # Print stats every 100 steps.
  if i % 100 == 99:
      '[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%' %
      (i + 1, loss / 100, num_correct)
    loss = 0
    num_correct = 0


Please enter your comment!
Please enter your name here