Artificial neural networks are not conscious
Artificial neural networks (ANNs) are advertised as the holy grail of modern AI; the building blocks that are supposedly bringing machines one step closer to achieving consciousness (spoiler alert: they aren't...yet). While they are certainly capable of many great things, there is a misconception of what they are, and the functions they serve. This two-part article will attempt to address these misconceptions and make the argument that in essence, some of the architectures of ANNs can best be understood as auto-learning compression algorithms.
The upcoming discussion is not meant to undermine the current state of ANNs, or the great feats they are capable of helping us achieve - today, ANNs are undoubtedly a basic building block for future AI. Instead, the discussion is intended to demystify the technology, and draw parallels to help readers make sense of the inner workings of ANNs. This article will provide a brief overview of ANNs, compression algorithms, and how the two are related. Ready? Let’s dive in.
What are ANNs?
Put in simple terms, ANNs are one of the techniques used in machine learning to allow computers to learn from data. Their functionality is modeled on the inner workings of the human brain. ANNs consist of basic building blocks called Artificial Neurons (Perceptron) that emulate the neurons of the brain, where they trigger or pass a signal only when the signal is strong enough to pass a barrier. These neurons build connections to one another in order to be able to form more complex patterns; these connections are strengthened when neurons fire together. Make no mistake, though: although these technologies emulate some of the core functions of the human brain, they are not currently any closer to sentience because of it.
Artificial neurons are very simple mathematical functions: according to the Universal Approximation Theorem, any function can be approximated through the combination of many simpler functions in a certain arrangement. In our case, an ANN can approximate complex functions through its simple artificial neurons. While an oversimplification of what’s actually going on inside an ANN, it provides a foundational understanding of the technology for the purposes of this article.
What is compression?
Now that we’ve established a basic understanding of ANNs, let’s do the same for compression. Compression refers to the process of reducing the size of something to be able to store it or transfer it over a network efficiently. Compression algorithms are a set of mathematical equations that are meant to transform any given data into the smallest size possible. This is usually done by retaining only the most important and distinctive features of the data, and using them at a later point to reconstruct the full data accordingly. Compression can be carried out in two primary ways depending on the algorithm used and the data itself. These are:
Lossless compression: like PNG for images, and ZIP for universal compression;
Lossy compression: like JPEG for images, H264 for videos.
Getting into the inner workings of each compression algorithm would require a series of articles on its own, so for now let's proceed with a very simple example that’s relevant to machine learning.
Two slightly different fictional datasets are represented below in Figure 1 and Figure 2:
In both cases, we can see that the dataset is represented by a quadratic equation (the red curve overlaying the blue points). A data frame containing a million observations of this dataset is 23 MBs. The quadratic equation describing this dataset in Python, on the other hand, is 56 bytes - a compression ratio of 50,000.(for the sake of keeping the argument simple, we’re ignoring the fact that this needs dependencies to run). In this instance, in order to restore the original data we would need 1) the quadratic function describing the curve, and 2) the start and end value of the interval. While the curve can restore the data points through a set of simple steps, these steps would ultimately result in different outcomes for the above datasets: for the dataset on the left, it the entire data points can be restored fully, resulting in a lossless compression; however, for the dataset on the right, some data points will be lost upon reconstruction (the scattered blue points), resulting in a lossy compression.
So, what makes a good compression algorithm?
A good compression algorithm can be judged by 4 aspects.
Speed: how fast it compressed the data to the compressed form;
Compression ratio: how much reduction in size the algorithm could achieve;
How lossy it is;
How well it meets the objective function.
Let’s explore the last aspect on objective functions in more detail, specifically with lossy compression. People usually think of compression as something that makes data smaller, but don’t know the intricacies of what actually happens during this process. A good example of this is visual compression algorithms that are designed to trick your senses, like JPEG compression or H264 for video. But just how exactly do they do that? Our eyes are more sensitive to brightness than to colors because they have 120 million rod cells responsible for detecting brightness, but only 6 million cone cells for detecting colors. This makes tricking our visual senses regarding color quite easy.
In the image below, square A and B have the same color but our brain is more biased towards light that it sees them as two different colors and can only spot the similarity once they are connected and the influence of interpolation is no longer a factor. The way visual compression algorithms work is that they sample color much less than they sample brightness. Images are made up of the luma (light component) and the chroma (color components); visual compression algorithms sample chroma less than they sample luma. For example, for every 4 pixels sampled in luma, only 1 pixel is sampled in chroma. This is what’s referred to as chroma subsampling.
So if our end goal is to achieve a specific objective with the data (as is the case with visual compression) and not just keep the entire data itself, lossy compression can do wonders because then our only concern is keeping the minimal amount of features that serve the objective. This objective can be anything from reconstructing the full image, to tricking human senses to finding the smallest set of features that distinctively represent observations from a dataset.
The notion of compressing data from high vector space into a low dimension feature vector is what many neural network architectures successfully do. These distinctive feature vectors can then be used for all sorts of applications such as reconstructing the full image as is the case in normal image compression; creating a super resolution version of the full image for applications like gaming; in computer visions and natural language processing where these low dimension vectors are used directly to compare data similarity; or fed to all sorts of classifiers and reinforcement learning algorithms.
This article acted as an introduction to ANNs and a primer on compression algorithms. Part 2 of this article will explore the relationship between ANNs and compression in more detail, and explain the role that machine learning plays in the process. It will make the argument that - in essence - ANNs are nothing but fancy compression algorithms.