By Stefano Cariddi
In previous blog posts, we introduced the concepts of Data Science and Machine Learning. The goal of this article is to discuss how to obtain state-of-the-art results in these fields. The short answer to the question is: by ensembles of estimators. An estimator is a mathematical function that learns from data to associate a specific output to a given set of inputs. We will also refer to them as models.
Our statement is supported, for instance, by the following plot:
This plot comes from a tweet by François Chollet, the creator of Keras Library, displayed in the top position, and it shows the main tools used by the winners of 120 data science competitions held by Kaggle, the world’s largest data science community. In orange, we have Deep Learning libraries (i.e. libraries to create Neural Networks) which are ensembles of artificial neurons. In blue, we have traditional (classic) Machine Learning libraries (i.e. libraries dealing with everything that is not Neural Networks), the first two of which are specifically built for creating ensembles of estimators (also called meta-estimators, or meta-models), while the third one supports them, alongside with simpler models. Therefore, at least 115 out of 120 competitions (95.8%) were won through ensembling.
But what are, more specifically, meta-estimators? How do you create them? And which kinds are out there? This article will be aimed at answering these questions. It will be divided in two main sections: the first part will introduce and focus on Deep Learning, whereas the second one will deal with the field of traditional Machine Learning that was developed to orchestrate them, Ensemble Learning.
As we have said, deep learning is built upon neural networks, which are ensembles of artificial neurons. So the first question that arises is: what is an artificial neuron? Let us find out.
In short, an artificial neuron is a mathematical function that receives a series of values as inputs, multiplies them by a set of weights, and passes their weighted sum to another function, known as activation function, whose output is the output of the neuron itself.
If we drop all the mathematical details inherent in the previous formulation, we could see an artificial neuron as the artificial counterpart of biological neurons. A neuron is a cell that is constituted by the following three parts:
- dendrites, the filaments that receive electrical signals (i.e. the input);
- soma, the body of the cell that sums up all the positive and negative signals coming from the dendrites (which corresponds to the weighted sum that we mentioned before); and
- axon, the filament which returns the electrical signal if its value is higher than a certain threshold (which is mimicked by the activation function).
A visual representation of the aforementioned analogy can be seen in the following figure from the Wikipedia page of artificial neurons:
An extensive explanation of Deep Learning is beyond the scope of this article. In fact, from this simple idea, a huge amount of different kinds of neurons and architectures was born, and Deep Learning has grown to the extent of becoming the largest and, in many cases, dominant branch of Machine Learning. A comprehensive summary of such architectures can be found on Neural Network Zoo. In the following subsections, we will aim at answering two questions: where do artificial neural networks draw their predictive power from? And when should they be preferred to other kinds of models?
The activation function
The ability of neural networks to model every kind of information lies in their non-linear activation function. There are many kinds of activation functions, each with its own rationale. To keep this explanation short and easy, we will focus on one of the most common ones, the REctified Linear Unit (RELU) function (in blue in the following image from the dedicated page of the excellent blog Machine Learning Mastery):
As you can see, as long as the incoming signal is negative, the RELU function returns zero, then it returns the signal itself. At first glance, this shape may appear to lack any sense, but it is actually common in many real-world applications. For example, imagine that you are searching for a house, and each house that you evaluate is characterized, by simplicity, by the following three features:
- has the roof;
- is in a safe area; and
- is well connected to services.
Clearly, these three features are ranked by importance. In fact, it is important that a house is well connected to services, but it is even more important that it is in a robbery-free area. However, being in a safe area is not as important as having the roof. In fact, if you buy a roofless house, you will probably experience huge expenses as a consequence of this choice, since animals, humidity and bad weather will most likely damage all the interiors and the foundation as well.
Now let us assume to use the following features to determine the likability of a house. If a house has no roof, it probably does not matter how safe its area is and how well it is connected to the services: you will not like it. Alternatively, if it has a roof and it is well connected to services, but it is in a dangerous area, you may like it but not enough to buy it. Instead, if it has a roof and it is in a safe but not well-connected area, you could consider it as an option in case you do not find anything better. Finally, if it has a roof, it is in a safe area, and it is well connected to services, you will have found your house.
Congratulations! By thinking with this logic, you have unconsciously applied the RELU function to an everyday-life problem. Therefore, you can consider the RELU function elbow as the level of signal below which the minimum requirements for some condition to be met are not satisfied.
For such an elementary problem, one neuron could be enough to provide reliable results, but if your problem is to build an autopilot system for a self-driving car, then it is absolutely not enough. In this case, it would be necessary to build a network of neurons whose ability to model such incredibly complex and highly non-linear datasets will be the result of the combination of the non-linear activation functions. A very nice example of how a neural network is capable of modeling non-linear datasets is available at Tensorflow Playground, which allows you to customize a simple neural network and to train it for learning to model a bidimensional dataset of your choosing. The characteristics of artificial neural networks will be discussed later in this article in a dedicated sub-section.
Neural Networks and Big Data
Earlier we introduced two questions. Having discussed the source of the neural networks’ capability of modeling complex non-linear phenomena, let us now explain when they should be preferred over traditional machine learning models by using the plot below obtained from this publication (Wireless Networks Design in the Era of Deep Learning: Model-Based, AI-Based, or Both?), which links the performance of an estimator with the size of the dataset used for training it:
It is possible to find many similar plots in literature highlighting how deep learning models display a shallower learning curve than traditional machine learning estimators when the dataset is small, but severely outperform them as the dataset size increases exponentially. Unfortunately, it is not easy to find a version of such a plot providing information about the size above which Deep Learning consistently delivers superior performances. In fact, the position of such threshold depends on a series of factors, such as:
- the nature of the problem, and the metric used for the performance evaluation;
- the number of features and data points involved;
- the noise that affects the data; and
- both the classical and deep learning models involved in the study.
According to our experience, the crossing point happens somewhere in the range between ~1 GB to ~10 GB of data, and the situation displayed on the right edge of the plot is observed for petabyte-scale data. However, the computational power needed for handling such datasets is prohibitive for everyone, aside from a handful of high-tech companies.
An example of such models is the upcoming GPT-4, which will have 100 trillion connections between neurons, corresponding to the same amount as the human brain. GPT-4 is currently being developed by OpenAI, which was founded, among others, by Elon Musk.
The reason why it is possible to train such models to deliver predictions significantly better than those of traditional machine learning models is that the training of neurons is a process that can be parallelized over the cores of NVIDIA Graphic Processing Units (GPU), thus allowing the simultaneous weight tuning of thousands of neurons. Given the increasing importance of artificial neural networks, Google laboratories even developed Tensor Processing Units (TPU), which are processors specifically designed to train Deep Learning models. For a comprehensive performance and cost comparison between GPUs and TPUs, you can read this Google Cloud Platform blog post: Now you can train TensorFlow machine learning models faster and at lower cost on Cloud TPU Pods. The only thing to be added to the content of such article is that the said comparison was carried out with version 2 of Google TPUs, whereas currently version 4 has been released, which magnifies even further the performance gain.
Artificial Neural Networks
So we have introduced the concept of artificial neurons, we have explained why they can model complex, non-linear phenomena, and when they are the best choice for solving data science tasks. Now we will explain how to ensemble such neurons in networks.
In biology, each neuron can theoretically be connected to any other (provided that they are sufficiently close). Instead, in artificial neural networks, in order to reduce the number of connections, neurons are arranged in layers, which are connected to each other. Intra-layer connections are not possible. This means that each neuron within a layer learns a different piece of information from the previous layer, and such pieces can only be combined by neurons of the following layer. This can be seen in the following image, from Wikipedia page of Artificial Neural Networks:
The strength of a neural network lies in the number of its connections, and in the goodness of the tuning of their weights. No neuron alone could learn the non-linear structures of a dataset, nor could it do its best on petabyte-scale data. Union is strength.
Have you ever heard about Random Forests or Gradient Boosting? If so, after reading this section you will have a much clearer idea about them. But first we need to take a step back and go where it all began: with Decision Trees.
The Decision Trees
In the field of traditional Machine Learning, decision trees are the most common building blocks for ensemble models. Decision trees are estimators that perform their classification or regression task by, literally, building a tree of subsequent decisions (for a comprehensive tutorial, please, see this video). Let us consider an example from the Wikipedia page of Decision Tree Learning, regarding the survival of the passengers on the Titanic (the Titanic dataset can be found on Kaggle, at this link):
This simple decision tree is built upon three features:
Gender and age need no explanation. SibSp corresponds to the number of siblings and/or spouse that the passenger was traveling with.
The previous decision tree could be interpreted as:
- If the passenger gender is female, she survived. Else…
- If the passenger age is < 9.5 years, he died. Else…
- If the passenger number of siblings/spouse is < 3, he survived. Else, he died.
Of course, this is not a decision tree that works for every passenger of the Titanic. However, having been built starting from mathematical laws, it represents some true pattern visible within the data. So let us try and interpret this decision tree:
- If the passenger gender is female, she survived.
- This is reasonable. After all, women and children are the first to be evacuated in life-or-death situations.
- If the passenger age is < 9.5 years, he died.
- Wait, what? Are children expected to die more often than adults? This is strange… and yet, if the decision tree determined it, it means that this is what the data say.
- If the passenger number of siblings/spouse is < 3, he survived. Else, he died.
- This is interesting too… people with less siblings appear to have a higher probability of survival than those with many.
In order to comprehend what is happening behind the scenes, one should take a look at the entire dataset, which contains more features than the three used for building this simple decision tree. What one would realize, if he/she looked at the entirety of data, is that low-income people are those with the most numerous families, and that these people are preferentially hosted in the lower decks.
When the collision with the iceberg happened, the lower decks were the first to go underwater, and the people that actually managed to escape them were the last ones to arrive on the lifeboats. Therefore, although the model is actually fitting some real pattern, it could be improved by switching the age and the number of siblings/spouse with the ticket price, for example, or even better, the deck.
So what is this telling us? It’s telling us that many different decision trees could be built to model the same phenomenon, and although in this case we can grasp the cause-effect relations that could make a model more suitable than others, this is seldom the case in real life (see this article for more details).
Imagine if you were to model some complex phenomenon (such as predicting the sales of a product before it is launched on the market or if a human fetus will develop some rare genetic disease) starting from a pool of a thousand weakly or uncorrelated variables, many of which are noisy or redundant with each other. How would you do that? Certainly not by using a single tree. Thankfully for us, we can count on statistics, and in particular, on the Law of Large Numbers.
According to the Law of Large Numbers, the larger the number of trials that you perform, the closer their average output to the expected value. This is the idea at the base of the first Ensemble Learning technique that we are going to discuss: Bootstrapping Aggregation.
Imagine that you are the sales manager of a company and you need to make an economical offer to a customer for developing a solution that they require. The team responsible for the implementation consists of three persons:
- Adam, junior developer;
- Bob, developer; and
- Christine, senior developer.
You explain the project to the three of them, and you ask each of them to provide you, singularly, with an estimation of the number of working days needed to perform the task. The three developers reply like this:
- Adam, 6 days;
- Bob, 8 days; and
- Christine, 10 days.
In order to provide the customers with a single value, you choose to average the three predictions: (6 + 8 + 10) / 3 = 8. Congratulations, you have just applied bootstrapping aggregation (in short, bagging) to real life!
Bagging, in fact, consists in a first phase where N trees are built independently, using different data and feature subsets, and in a second phase where their predictions are averaged. Since errors tend to cancel each other out, the ensemble prediction set will be more robust, and in most cases, more accurate than the single prediction sets.
This is actually what happens in the case of our example; in fact:
- Every developer will be trained with different data because everyone has dealt with different projects in the past.
- According to his/her own experience and perspective, every developer will consider a different set of features for creating his/her own model. For example, Adam, being a junior developer, may be prone to neglecting or underestimating some important tasks because he did not realize that they were needed in the first place, whereas Christine, as a senior developer, may be able to correctly identify all the significant factors and frame them in the big picture.
For regression problems bagging-based meta-estimators perform the final estimation by computing the mean or median of all the evaluations for each sample. Instead, for classification tasks they use a majority voting consensus approach.
The two most famous bagging meta-models are Random Forest and Extremely Randomized Trees (Extra Trees). Both of them are built upon decision trees, but they differ in the way the trees are built. Whereas random forests are made of properly said decision trees, extra trees are built upon random trees, which correspond to decision trees where the splittings are performed randomly instead of according to some Information Theory criterion. Without adding too much detail, extra trees are used to reduce overfitting, which corresponds to fitting noise alongside with information.
And now, the fundamental question: when should you use bagging? The power of bagging consists of being able to correct the variance of data, which corresponds to inherent, random fluctuations. For example, random fluctuations are responsible for the different values that you obtain if you measure your body temperature with a IR-rays thermometer multiple times in a row. In fact, relying on multiple, independent estimations, bagging is able to reduce the statistical fluctuations and let the underlying signal emerge.
So what if you, as a sales manager, realize that by simply averaging the estimations of Adam, Bob, and Christine, you constantly end up underestimating the development time? Well, in this case, your issue is not variance and you should move to the second ensemble learning technique: Boosting.
Whereas bagging corrects variance, boosting has been developed to reduce bias. So what is bias? Simply put, it is the average difference between the true and predicted values. For example, if you predict 8 days, while the final amount is 9, and the next time you predict 15, while the true value is 16, then you have a bias of −1 day. Differently from variance, bias is not imputable to the data, but to the model itself.
The careful reader may notice that this definition could appear to leave out the case of biased datasets, so we briefly specify that the sample selection is, in our definition, the zero-th step of the model creation. Therefore, without adding further details, if the samples fed into our model are biased, such bias will reverberate on the predictions as well.
Getting back on track, averaging the developers’ estimations is probably not the most robust approach. In fact, Adam, being a junior developer, will be more prone to making errors than Bob, who will be, in turn, more prone to making errors than Christine. So, in order to perform a better estimation, you could ask Adam to perform his own evaluation and pass it to Bob. Bob will review and correct it according to his own experience before passing it to Christine who will repeat the procedure. Finally, you will collect all the predictions and average them by giving more weight to Christine’s and less to Adam’s. Congratulations, you have now done boosting as well!
As you could see from this example, whereas bagging simply parallelizes the estimators and gathers their predictions, boosting chains them.
The first boosting-based meta-model that has been developed is Adaptive Boosting. In its first implementation, the various estimators were instructed to give more weight to misclassified data points in order to correct the residual errors. Finally, all their predictions were averaged with a weight that was proportional to the strength of each learner (namely, its predictive power). Without entering too much in detail, Adaptive Boosting was later adapted to minimize an exponential loss function which was a more general and useful case. However, in order to generalize it further, any kind of loss function had to be reproducible. For achieving this, a new approach was needed and the estimators were repurposed to approximate the gradient of the loss function which led to the birth of Gradient Boosting. Therefore, Gradient Boosting is more flexible than Adaptive Boosting which is why it led to the very powerful implementations that we saw in the image at the beginning of this article: the eXtreme Gradient Boosting and Light Gradient Boosting Machine.
What bagging and boosting have in common is that they are built upon decision trees, so what if you wanted to mix different kinds of estimators? The answer to this question dates back to the early ’90s to a paper called Stacked Generalization (Wolpert 1992), which set the foundation for the most powerful (and yet currently less explored) paradigm of ensemble learning: the Stacked Generalization.
Over the years, some authors referred to it as Stacking or Blending interchangeably, whereas others introduced a difference between these concepts. Since detailing every minutia of this approach is beyond the scope of this article, we will simply stick to the original definition and use it to aggregate all the techniques that were born from it.
In the context of stacked generalization, both bagging and boosting approaches are possible, but with heterogeneous estimators. In fact, in order to create a stacked-ensemble model (i.e., a meta-model built according to the stacked generalization paradigm), the various estimators are first aggregated within layers in a bagging-like fashion, then the layers are chained in the same way as in a boosting-type meta-estimator. Differently from boosting, however, each layer can either work with its predecessor’s outputs or with both its predecessor’s outputs and the original data. In the end, the structure of a stacked-ensemble model closely resembles that of an artificial neural network.
A very important difference among artificial neural networks, bagging/boosting estimators and stacked ensemble models, however, lies in the number of estimators involved. In fact, the stronger the estimators, the smaller the amount of them required to build a powerful model. An artificial neural network can be made of millions of neurons, being very weak learners, while random forests and gradient boosting machines can be limited to a few hundreds/thousands. Stacked ensemble models, instead, can be made of only a handful of estimators/meta-estimators, and in any case, not more than a few dozens. The real complexity of building a stacked ensemble model, in fact, lies elsewhere, and it can be summarized in three points:
- selecting the right blend of estimators for building the layer
- selecting the best hyperparameter combination for each estimator of the layer
- in case of a multi-layer architecture, selecting also which data to feed into each layer
We will now analyze each of these points.
Differently from all the other ensemble learning techniques, there is no standard estimator to fit in the slots of the model that you are going to create, so how do you choose? Should you use a random forest, a gradient boosting, or something different? And which hyperparameters should you use to initialize that specific estimator? In fact, a random forest made of a thousand trees with a maximum depth of four levels will have a completely different structure (and predictive power as well) than a random forest with five hundred trees and a maximum depth of eight levels. Unfortunately, there is no universally-acknowledged criterion for building up a layer.
A tempting solution could be to run n different models and stack the best ones in the desired layer. Now imagine that the two best models are the following ones:
- random forest with 1000 decision trees and some other hyperparameters values
- random forest with 950 decision trees and all the other hyperparameters values equal to those of the random forest above
It is clear that they are very similar to each other. Consequently, their prediction sets will roughly be the same, and this is a problem. In fact, the key for a successful stacked ensemble model lies in diversity. If the only difference between the prediction sets of the two random forests that we are considering is that the second one makes three more errors than the first one, then adding it to our layer does not add any information at all! Therefore, when choosing which models to fit in the slots of your layer, you should not only look at the information that they share with the target variable, but also at the information that they share with each other. Let us prove the concept with numbers.
We want to stack three binary classification models (i.e., with an output that can only be 0 or 1) in a layer that will perform the final predictions according to a majority of voting criterion. Namely, each prediction will be given by the class that receives more votes. This is an example of poorly chosen estimators (correct predictions are in boldface):
One can see that the second and third estimators do not add any information to that of the first one. Actually, they add noise! In fact, the majority of the voting prediction set has an accuracy of 60% in comparison with the 80% of the first model.
On the contrary, this would be a successful combination:
In this second case, although the best estimator gives us, again, an accuracy of 80%, the combination of very different prediction sets leads us as high as 90% for the majority of the voting set.
These considerations should guide the hyperparameters tuning as well. The hyperparameters values, in fact, should not be chosen in order to maximize the predictive power of the single estimator to which they refer, but in order to maximize its diversity from all the others. This often implies to give up some predicting power within each estimator in order to obtain more predictive power for the ensemble. Hence, creating a successful stacked ensemble model is not different from creating a successful human team: rather than seeking for the superstar that can do everything on his/her own, you should try to build synergy and harmony in such a way that each team member’s weaknesses are compensated by the ensemble strength. If this is done properly, no superstar will be able to beat your team. If you want data science-related proof, check what the Netflix Prize was and how it was won. It is an enlightening and inspiring story that you can find, among other places, in Gina Keating’s book Netflixed: The Epic Battle for America’s Eyeballs. (Summary for the lazy readers: the two best models, which finished in a tie, were both stacked ensemble models, and they were created thanks to the joint effort of smaller teams that merged in order to match someone’s good ideas with someone else’s better coding skills; essentially, they performed ensembling in real life too).
Finally, when chaining multiple layers, you need to choose also the inputs that each layer will receive. For example, you could enable the second layer to work with only the outputs of the first layer, or you could stack both the outputs of the first layer and the original features. What is the difference? Well, in the first case the second layer will be built and trained much more quickly because it will work with fewer features, whereas in the second case you will probably obtain better results but at the expense of more time and computational resources. However, if the first layer is extremely powerful on its own, the improvement may be so marginal that it would not be worth the effort, and you may end up overfitting your data.
In this article, we have given an overview of the different kinds of meta-estimators and we have detailed why they are so important for achieving top-notch results in data science. Every kind of meta-model is a tool and, just like any other tool, it offers different advantages and disadvantages. Choose wisely when working with your data and do not be afraid of trying new approaches because they are the only way to allow the progress of mankind. Happy ensembling!