- Mohamed Salem
Is Feature Engineering Dead?
In an earlier post, we discussed the inner workings of AI and its basic building block, the neural network. Neural networks are special, they’re what allows for deep learning. Being unlike any other machine learning method, neural networks are not programmed to abide by a specific sort of relationship, like a line or a curve (what we often call a functional form); rather, they are left to their own devices, to take whatever form matches the information they’re observing. The promise of deep learning is the self-educating machine which is capable of learning just like we humans do: through observation. The greater implication is that deep learning models have the ability to discover deeper (duh!) relationships that are not immediately obvious, the kind of relationships that usually require an expert to point you towards.
This begs the question, does deep learning spell goodbye for human professional input? Will consultants be relegated to at-home armchairs or is there still room for human input in the model education process?
Recall that a neural network is what we call a function approximation tool. What this means is that when fed data, a neural network is able to recover the true mathematical relationships between elements of the dataset. The neural network does this essentially by trying out different combinations of relationships until it finds one that fits. You can think of this as a series of light switches, where you have 3 lightbulbs and 7 light switches, and you’re trying to figure out what combinations of on/off switches lead to which lighting combinations. A neural network tries most combinations, until it is able to build an equation that describes the relationship between which light switches are on, and which lightbulbs are on.
The defining feature of a neural network is its flexibility, which is reflected in the Universal Approximation Theory. The Universal Approximation Theory, in short, proves that a neural network is capable of approximating ANY mathematical function whatsoever, no matter how complex. So, for any relationship that can be expressed as a function – which, in fact, is all relationships – the universal approximation theorem leads to the conclusion that neural networks can recover any underlying relationship just given the data.
Another cornerstone of machine learning and AI is feature engineering. We can loosely describe feature engineering as the art of transforming given features in the dataset into new features that are well adapted to the proposed model’s capabilities. The idea is that some kinds of machine learning models are restricted to identifying relationships of a certain type (e.g. linear relationships), and without feature engineering, these models would be incapable of identifying some existing relationships that do not match its capabilities.
But when everything’s said and done, at the end of the day transformations are just functions, and by the aforementioned Universal Approximation Theory, neural networks can approximate any function. So, do neural networks completely abolish the need for feature engineering?
To answer that question, we’ll introduce a new term: feature learning. Feature learning refers to a model’s ability to automatically engineer features. Feature learning becomes most important when the data fed into the model is highly variable in nature. This means that the relevant features tend to change from one use to the next – think pictures of cats versus pictures of mountains – where it would be tedious to engineer features for every new task or dataset the model faces, requiring both a significant time investment and domain-knowledge, and carrying an inescapable degree of subjectivity.
When data is fed to a model, depending on the model’s capabilities, the model attempts to define a relationship between the different elements of the dataset. A dataset with a large number of features represents a greater computational challenge for the model, as the model attempts to accommodate a higher dimensional space in its search for the optimum solution. As implied before, some models are also restricted, meaning that they are unable to accommodate interactions between the different features. Linear models, for instance, cannot consider non-linear transformations of the data unless explicitly introduced by feature engineering.
Deep learning methods are, once again, of course known for their remarkable flexibility, making the latter flexibility restriction a non-issue. For example, in classification tasks, neural networks can create decision boundaries that accommodate complex transformations in the data. However, with greater flexibility comes a greater number of choices to make and a greater number of combinations to explore.
Deep learning methods are, once again, of course known for their remarkable flexibility, making the latter flexibility restriction a non-issue. For example, in classification tasks, neural networks can create decision boundaries that accommodate complex transformations in the data. However, with greater flexibility comes a greater number of choices to make and a greater number of combinations to explore. Given a raw dataset, a flexible neural net is free to explore all possible configurations and transformations in its search for the best function. Some transformations are easily apparent to domain experts (e.g. the famous inverted-U relationship between age and income, fig.1) but not to an automated algorithm. One look at an Income vs. Age dataset and an analyst familiar with the subject would recommend adding a squared age feature. The neural network, on the other hand, may have to examine a larger pool of potential configurations before finding the optimum solution.
Let’s use our previous example of a single feature to illustrate the difficulties a neural net could face without feature engineering. Consider the following function:
Attempting to model this function using a single layer neural network with ReLU activation leads to the following:
Clearly, a single layer neural network is not capable of recovering the true function since it’s simply an increasing monotonic transformation of. Yet, adding a single hidden layer with 5 neurons will allow our network to begin capturing the U-shape of the relationship:
Adding three hidden layers with 10 neurons each allows for almost perfect recovery of the relationship:
This illustrates the fact that for a neural network to be flexible enough to capture any relationship, it needs to be deep enough (i.e. have many hidden layers, with many neurons). Having more hidden layers comes with the cost of additional computations and additional tuning. If you start from a poor initialization (that means that your starting guess for the weights on were far from the true weights), the neural net may take many iterations to converge. This is further complicated by low iteration limits and poorly tuned hyperparameters, not to mention more complicated relationships. Recall that all the neural net does is try different combinations until it finds one that works. Starting far away from the true solution, with a poorly tuned learning rate (essentially how much the model changes from each previous iteration), and a cap on the number of iterations can mean that the network never finds the true solution in most cases.
Contrast this with the case where we feed the model x4 directly. In that case, a single layer neural network, with a single neuron, is in fact sufficient to capture the true relationship, not to mention much easier and faster to train:
Feature engineering in this case represents outright feeding of complex features to the model, essentially giving it a head start. While, theoretically, deep neural networks can and will recover any transformation to the data, this comes with a significant cost in terms both computation and time. This cost, in a sense, represents the superiority of the human brain’s architecture, specifically its ability for generalized learning, over our current AI architectures; and while that human edge remains, there will still be room for feature engineering.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2, no. 5 (1989): 359–66. https://doi.org/10.1016/0893-6080(89)90020-8.