Unraveling Convolutions

May 13, 2024

Everyone knows and talks about the curse of dimensionality, but few mention the blessing of dimensionality. It makes sense that increasing the dimension of the data rapidly increases the search space, sometimes making it intractable to find a good solution. The other side of the coin is that high dimensionality can leave a lot of room for simplification of expected geometries and structure. If we consider visual data, which is a very high dimensional input, we can see that it is structured. Pixels that are near each other are correlated. Moreover, there are different spatial and symmetric properties present in images.

We might try to use a simple neural network (multilayer perceptron) to learn from visual data, but the dimension of the weights and bias of the network will depend on the input dimensions in order to properly do matrix multiplication. Additionally, a simple network does not naturally model the spatial and symmetric properties inherent within visual data. As a result, we have two related issues:

We've resigned ourselves to an incredibly large search space due to the high dimensions of the input, weights, and bias. Converging on a solution may be too difficult or long.
The neural network will have to do the painstaking task of learning from many image examples to only find that pixels are correlated to other nearby pixels, which we already knew.

This is where convolutional neural networks come in, addressing both of the issues listed above. To get a better understanding, we can consider a simple example. Let's say we have a visual input in the form of a 3x3 matrix:

$[\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}] .$

The natural question is: how to incorporate our understanding of visual data on these image inputs to simplify them? Clearly, we want to filter the pixels somehow and one obvious way might be max or average pooling. We can group neighboring pixels into a single value by taking their maximum or average value. This downsamples and simplifies our image input, but we may be losing too much information with this filtering. We may want to filter in a more intelligent manner, like emphasizing or de-emphasizing specific features in the image. That would require us transforming the image data in some way, which we can do by adding or multiplying by another value. It's not immediately clear how to efficiently use addition to emphasize certain features, while preserving some of the original features of the image, but multiplication provides us with an avenue.

One such way is through cross-correlation, where a filter or kernel is slid across the image to transform the elements bit by bit through multiplication. This cross-correlation can be used to emphasize, detect, or otherwise transform specific features in our image matrix. The nice thing is that the convolution operation is similar to doing cross-correlation, except the kernal is flipped.

Let's work through a simple problem to see how this works. Going back to the previous example,

$[\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}],$

let's design a kernel to detect the feature of diagonal 1s. We can do this with the following matrix kernel:

$[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] .$

Let's do a sliding dot product between our 3x3 input matrix and 2x2 kernel. Sliding from left to right, we get

$[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] \cdot [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] = 1 * 1 + 0 * 0 + 0 * 0 + 1 * 1 = 2,$

$[\begin{matrix} 0 & 0 \\ 1 & 0 \end{matrix}] \cdot [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] = 0 * 1 + 0 * 0 + 1 * 0 + 0 * 1 = 0,$

$[\begin{matrix} 0 & 1 \\ 0 & 0 \end{matrix}] \cdot [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] = 0 * 1 + 1 * 0 + 0 * 0 + 0 * 1 = 0,$

$[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] \cdot [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] = 1 * 1 + 0 * 0 + 0 * 0 + 1 * 1 = 2.$

Collecting all the results from the sliding inner product, we form a new matrix which can be named as the feature map:

$[\begin{matrix} 2 & 0 \\ 0 & 2 \end{matrix}] .$

There are two main observations from this outcome. The first is the downsampling of the 3x3 input matrix into a 2x2 feature map, and the second is the emphasis or increase in magnitude of the diagonal 1s feature. For the first observation, we get the downsampling effect because we did not pad the input matrix with 0s on the border. As such, a convolution or cross-correlation operation does not inherently downsample, but insufficiently padding or using strides greater than one do downsample the matrix as a consequence. The second observation is that we have a new matrix that preserves and heightens the feature of diagonal 1s. This is useful because if we performed further compressions on this feature map, like max or average pooling, this information would have a better chance of being passed through or preserved through the operation. The idea is that we will lose information in the process of reducing the dimension of our input data, so we should try to emphasize the features that we care about to ensure some representation of it still survives after all the reductions.

The other benefit is that contrasting and emphasizing specific features allows a learning algorithm to pick up on the structure of the data more easily. You can imagine that plugging in a multilayer perceptron (MLP) to a feature map where one feature is less distinguishable from another compared to a feature map where features are heavily contrasted with each other will be more difficult for the MLP to learn a representation.

Naturally, the final question might be how all this fits into convolutional neural networks and their ability to process visual data. From the example, having to manually design a kernel to detect features in an input matrix seems tiresome and human-designed kernel for some features may be too difficult to create. Instead, the convolutional layers in a neural network have learnable parameters for the kernel we apply to the input matrix, which allows us to automate the entire process of kernel design and feature extraction by training a neural network on some objective. The learnable parameters of a convolution kernel is updated, and eventually realizes a design that simplifies high-dimensional data like images into useful intermediate representations for the MLP component of the neural network to learn a sufficient task representation that lowers the loss.

I hope this was useful for understanding how convolutions can be used to process the high-dimensional data and allow neural networks to learn from such data. My next post will tackle some of the core ideas behind transformers while my previous post covered a simple example of a multilayer perceptron.