A Convolutional Neural Network is a special class of neural networks that are built with the ability to extract unique features from image data. For instance, they are used in face detection and recognition because they can identify complex features in image data.
Like other types of neural networks, CNNs consume numerical data.
Therefore, the images fed to these networks must be converted to a numerical representation. Since images are made up of pixels, they are converted into a numerical form that is passed to the CNN.
However, as we will discuss in the upcoming section, the entire numerical representation is not passed into the network. To understand how this works, let’s look at some of the steps involved in training a CNN.
Reducing the size of the numerical representation sent to the CNN is done via the convolution operation. This process is vital so that only features that are important in classifying an image are sent to the neural network. Apart from improving the accuracy of the network, this also ensures that minimal compute resources are used in training the network.
The result of the convolution operation is referred to as a feature map, convolved feature, or activation map. Applying a feature detector is what leads to a feature map. The feature detector is also known by other names such as kernel or filter.
The kernel is usually a 3 by 3 matrix. Performing an element-wise multiplication of the kernel with the input image and summing the values, outputs the feature map. This is done by sliding the kernel on the input image. The sliding happens in steps known as strides. The strides and the size of the kernel can be set manually when creating the CNN.
For example, given a 5 by 5 input, a kernel of 3 by 3 will output a 3 by 3 output feature map.
In the above operations, we have seen that the size of the feature map reduces as part of applying the convolution operation. What if you want the size of the feature map to be the same size as that of the input image? That is achieved through padding.
Padding involves increasing the size of the input image by “padding” the images with zeros. As a result, applying the filter to the image leads to a feature map of the same size as the input image.
Padding reduces the amount of information lost in the convolution operation. It also ensures that the edges of the images are factored more often in the convolution operation.
When building the CNN, you will have the option to define the type of padding you want or no padding at all. The common options here are valid or same. Valid means no padding will be applied while same means that padding will be applied so that the size of the feature map is the same as the size of the input image.
Here is what the element-wise multiplication of the above feature map and filter would look like.
A Rectified Linear Unit (ReLU) transformation is applied after every convolution operation to ensure non-linearity. ReLU is the most popular activation function but there are other activation functions to choose from.
After the transformation, all values below zero are returned as zero while the other values are returned as they are.
In this operation, the size of the feature map is reduced further. There are various pooling methods.
A common technique is max-pooling. The size of the pooling filter is usually a 2 by 2 matrix. In max-pooling, the 2 by 2 filter slides over the feature map and picks the largest value in a given box. This operation results in a pooled feature map.
Pooling forces the network to identify key features in the image irrespective of their location. The reduced image size also makes training the network faster.
Applying Dropout Regularization is a common practice in CNNs. This involves randomly dropping some nodes in layers so that they are not updated during back-propagation. This prevents overfitting.
Flattening involves transforming the pooled feature map into a single column that is passed to the fully connected layer. This is a common practice during the transition from convolutional layers to fully connected layers.
Fully connected layers
The flattened feature map is then passed to a fully connected layer. There might be several fully connected layers depending on the problem and the network. The last fully connected layer is responsible for outputting the prediction.
An activation function is used in the final layer depending on the type of problem. A sigmoid activation is used for binary classification, while a softmax activation function is used for multi-class image classification.
Having learned about CNNs, you might be wondering why we can’t use normal neural networks for image problems. Normal neural networks can’t extract complex features from images as CNNs can.
The ability of CNNs to extra features from images through the application of filters makes them a better fit for image problems. Also, feeding images directly into the feed-forward neural networks would be computationally expensive.
You can always design your CNNs from scratch. However, you can also take advantage of numerous architectures that have been developed and released publicly. Some of these networks also come with pre-trained models that you can easily adapt for your use case. Some popular architectures include:
You can start using these architectures through Keras applications. For example, here is how to use VGG19:
Let’s now build a food classification CNN using a food dataset. The dataset contains over a hundred thousand images belonging to 101 classes.
Loading the images
The first step is to download and extract the data.
Let’s take a look at one image from the dataset.
Generate a tf.data.Dataset
Next, load the images into a TensorFlow dataset. We’ll use 20% of the data for testing and the rest for training. Therefore, we have to create an ` ImageDataGenerator` for the training and testing set.
The training set generator will also specify a couple of image augmentation techniques, such as zooming and flipping the images. Augmentation prevents overfitting in the network.
With the generators at hand, the next step is to use them to load the food images from the base directory. While loading the images, we specify the target size of the images. All images will be resized to the specified size.
When loading the images we also specify:
- The directory where the images are being loaded from.
- The batch size, in this case, 32, meaning that the images will be loaded in batches of 32.
- The subset; whether it’s training or validation.
- The class mode as categorical since we have multiple classes. In the case of two classes, this would be binary.
The next step is to define the CNN model. The architecture of the network will look like the steps we discussed in the how CNNs work section. We’ll use the Keras Sequential API to define the network. CNNs are defined using the Conv2D layer.
The Conv2D layer expects:
- The number of filters to be applied, in this case, 32.
- The size of the kernel, in this case, 3 by 3.
- The size of the input images. 200 by 200 is the size of the image and 3 indicates that it’s a colored image.
- The activation function; usually ReLu.
In the network, we apply pooling with a filter of 2 by 2 and apply a Dropout layer to prevent overfitting.
The final layer has 101 units because there are 101 food classes. The activation function is softmax because it is a multiclass image classification problem.
Compiling the CNN model
We compile the network using categorical loss and accuracy because it involves multiple classes.
Training the CNN model
Let’s now train the CNN model. We apply the EarlyStopping callback in the training process so that training stops if the model doesn’t improve after a number of iterations. In this case 3 epochs.
The image dataset we are working with here is quite large. We, therefore, need to use GPUs to train this model. Let’s utilize free GPUs offered by Layer to train the model. To do that, we need to bundle all the code we have developed above into a single function. This function should return a model. In this case, a TensorFlow model.
To use GPUs to train the model, just decorate the function with a GPU environment. This is specified using the fabric decorator.
Training the model is done by passing the training function to the `layer.run` function. If you wish to train the model on your local infrastructure, call the `train()` function normally.
With the model ready, we can make predictions on a new image. This can be done in the following steps:
- Fetch the trained model from Layer.
- Loading the image of the same size as the one used in the training images.
- Convert the image into an array.
- Transform the numbers in the array to be between 0 and 1 by dividing by 255. The training images were of the same form.
- Expand the dimensions of the image to add a batch size of 1, since we are making predictions on a single image.
In this article, we have covered CNNs at length. Specifically, we talked about:
- What are CNNs?
- How CNNs work.
- CNN architectures.
- How to build a CNN for an image classification problem.
- How to train the model on remote GPUs using Layer
Originally published at https://www.kdnuggets.com.