Unlocking the Secrets of Gradient Boosting Classifier.
This is the final project for PACMANN’s online course on Advance Machine Learning class. In this article, I will share what I have learned about the Gradient Boosting algorithm, especially for classification problems.
I have provided the Python code for this algorithm from scratch, which is available on my GitHub repository.
The story behind the Gradient Boosting Algorithm.
Before we delve into the details of the gradient boosting algorithm, let's have some fun and play a game first [1]!
Suppose we are given datasets {(x1, y1), (x2, y2), …, (xn, yn)}
and our task is to fit some model F(X)
that can predict y close enough, i.e., minimize the loss. We cannot remove or modifyF(X)
, but we can add another regression tree model, h(X)
, to create a new F(X)
, which is now defined as F(X) + h(X)
.
Assume that our model F(X)
is effective but not flawless. There are some mistakes: F(x1) = 1.1
while y1 = 1.0
and f(x2) = 3.4
while y2 = 3.5
. So, we need to improve our model by adding another h(X)
s.t.:
Or, we can say that we want h(X)
to satisfy this equation:
But how do we can get h(X)
? The answer is pretty simple: just fit a regression tree using {(x1, h(x1), (x2, h(x2), …, (xn, h(xn)}
dataset, where:
The term h(X)
is referred to as a pseudo-residual or pseudo-response [1,2,3,4], whose role is to make up for the limitations of the current model F(X)
[1]. If the new model F(X) + h(X)
is unsatisfactory, another h(X)
can be added [1]. That's why the gradient boosting model is considered an ensemble model. It combines several simple decision trees, called weak prediction models, to provide a prediction model that makes very few assumptions about the data [4].
Now, where does the term “gradient” come from?
Recall that gradient descent minimizes a function by moving in the opposite direction of its gradient or mathematically represented as [1]:
Suppose we define our loss function as:
and we want to minimize the sum of L(y_i, F(x_i))
w.r.t F(x_i)
, s.t.:
Thus, we can say that:
Then, the equation below is satisfied.
In summary, the gradient boosting algorithm is built in a stage-wise fashion, which is like other boosting methods. This method is more flexible as it can optimize any differentiable loss function, thereby generalizing other methods.
The Gradient Boosting Classifier Algorithm.
For the purpose of simplicity, we will limit our discussion to binary class classification, where the outcome is either 0 or 1.
Classification Problem.
A common approach for binary classification is to build some model f(x) to get a prediction y in the form of logits (log of odds) from any predictor x and then convert the logits into probabilities p using logistic functions [4,5]:
Because the output of f(x)
is y
, which takes any real value on (−∞,∞)
, then there is no hard constraint to the components or the summation. This leads to the construction of a regression tree for the Gradient Boosting Classifier algorithm, whose sum provides a log of odds y
.
Now, how do we measure the error of our model‘s predictions? The solution is to calculate the likelihood of getting y
from x
[5]. The likelihood can be mathematically represented as [4,5]:
Consider this case; we have y_true
as our real data and P(y|X)
as our predicted value in terms of probability, as shown below.
Thus, the calculated likelihood is:
We can simplify the computation of likelihood by taking its logarithm (log-likelihood) [4,5]:
From the previous example, we know that if y_true = 1
, then the probability is P(y_i=1|x_i, F(x_i))
. But if y_true = 0
, then the probability is P(y_i=0|x_i, F(x_i))
or we can say 1-P(y_i=1|x_i, F(x_i))
.
To simplify the calculation, we can modify our log-likelihood function into [4,5]:
where p = P(y_i=1|x_i, F(x_i))
. To achieve optimum F(X)
, we need to maximize the log-likelihood, that is, minimize its negative. Since our model F(X)
predicts y in terms of log odds, then the negative log-likelihood can be modified as:
In conclusion, to solve the binary classification problem, we aim to minimize the loss function known as negative log-likelihood.
Gradient Boosting Classifier Pseudocode.
From the previous chapter, we already know that the main idea of the gradient-boosting algorithm is to fit a bunch of regression trees based on its pseudo-residuals or the negative gradient of its loss function to get the closest approximation of y=f(x)
. For binary classification problems, our objective is to minimize the negative log-likelihood.
Fitting.
The following is the pseudocode for fitting the gradient-boosting classifier algorithm using regression trees as the weak learners [2,3,4].
Let's take a closer look at each step of the pseudocode.
Step 1.
We have a dataset D {(x,y)}
, where x
is a row of measurements that we will use to predict, and y
is the target value (either 0 or 1). Our differentiable loss function is the negative log-likelihood. M
represents the number of iterations or regression trees we want to construct. And lastly, the learning rate ν
(nu).
Step 2.
Create a decision stump model F_0(x)
with a constant value. It says that we have to find the most optimum γ
value (log odds) that minimizes the sum of the loss function, a.k.a take the derivative of the loss function w.r.t to γ
and then set it to 0.
After solving the differential equation, we will obtain the probability as shown below:
Then, the value of F_0(x)
is the log odds of the probability (γ
):
Step 3.
For m = 1 ... M
, do:
A. Compute the pseudo-residual r_im
for each row in the data (i
) for the current tree (m
). After doing the math, r_im
is just y_i-p
where y_i
is the actual target and p is the predicted probability given x_i
from the previous tree F(x) = F_{m-1}(x)
.
B. Fit a new regression tree F_m(x)
to the pseudo-residuals r_im
. Creating terminal regions means getting all the leaves of the tree and naming them R_jm
for each leaf.
C. Compute γ_jm
, which means we have to compute all the leaf values using the given formula. Here is the simplified and easier-to-code formula to get the leaf value.
The equation above states that for every leaf, the leaf value is the sum of all the pseudo-residuals that end up in the leaf divided by the sum of the previous probabilities times 1 minus the previous probability for each residual, which was obtained from the previous model F_{m-1}(x)
.
Step 4.
Save all of the fitted regression trees F_m(x)
for m = 0 ...M
.
Predicting.
Here is the pseudocode that can be used to predict given data x
using fitted regression trees with the gradient-boosting classifier algorithm.
Let's examine each step of the pseudocode more closely.
Step 1.
We have an input feature x
, fitted regression trees F_m(x)
for m=0 ... M
, and the learning rate ν
(nu).
Step 2.
Initialize a variable called total_log_odds
, to sum up all log_odds
from each regression tree.
Step 3.
For each regression tree F_m(x)
, get the leaf value given input data x as a predicted log odds, then times the learning rate ν
(nu) except for the first regression tree F_0(x)
. After that, add the value to the total_log_odds
variable.
Step 4.
Calculate the predicted probability from the total_log_odds
using this formula.
Step 5.
Determine the class prediction by comparing the calculated probability for the input data x
with 0.5
. If the probability is greater than 0.5
, the predicted class is 1
; otherwise, the predicted class is 0
.
Implementation.
Now, after understanding the details of how the gradient-boosting classifier algorithm works, we will proceed to apply the algorithm to solve a simple problem. I have provided the code from scratch in Python here. We will use a dataset builder from the sklearn library. Here is the basic implementation of the gradient-boosting classifier algorithm.
from ml_from_scratch.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from ml_from_scratch.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=42,
stratify = y)
clf = GradientBoostingClassifier(n_estimators=10, learning_rate=0.1, max_depth=4)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, pred)*100}%")
Hyperparameter Experiments.
In this section, we experiment with tuning the Gradient Boosting Classifier’s hyperparameters: n_estimators
, learning_rate
, and max_depth
, using KFold cross-validation to ensure accurate results. We test various values for each hyperparameter to see how they affect the model’s performance. For n_estimators
, we try values like 10, 50, 100, and 150 to understand the effect of the number of trees. For learning_rate
, we explore rates from 0.1 to 0.4 to see how quickly the model learns. And for max_depth
, we look at depths from 2 to 5 to study how tree depth influences complexity. This systematic approach helps us find the best combination of these parameters to improve the model’s accuracy. Here is the code for conducting the experiment.
import numpy as np
from ml_from_scratch.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from ml_from_scratch.metrics import accuracy_score
from ml_from_scratch.model_selection import cross_val_score, KFold
import pandas as pd
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
cv = KFold(n_splits = 5, shuffle = True, random_state = 42)
params = [
{'n_estimators': 10, 'learning_rate': 0.1, 'max_depth': 4},
{'n_estimators': 50, 'learning_rate': 0.1, 'max_depth': 4},
{'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 4},
{'n_estimators': 150, 'learning_rate': 0.1, 'max_depth': 4},
{'n_estimators': 10, 'learning_rate': 0.2, 'max_depth': 4},
{'n_estimators': 10, 'learning_rate': 0.3, 'max_depth': 4},
{'n_estimators': 10, 'learning_rate': 0.4, 'max_depth': 4},
{'n_estimators': 10, 'learning_rate': 0.1, 'max_depth': 2},
{'n_estimators': 10, 'learning_rate': 0.1, 'max_depth': 3},
{'n_estimators': 10, 'learning_rate': 0.1, 'max_depth': 5},
]
train_acc_list, test_acc_list = [], []
for param in params:
train_acc, test_acc = cross_val_score(
estimator = GradientBoostingClassifier(**param),
X = X,
y = y,
cv = cv,
scoring = 'accuracy'
)
train_acc_list.append(np.mean(train_acc))
test_acc_list.append(np.mean(test_acc))
summary = pd.DataFrame({
'n_estimators': [n['n_estimators'] for n in params],
'learning_rate': [n['learning_rate'] for n in params],
'max_depth': [n['max_depth'] for n in params],
'train_acc': train_acc_list,
'test_acc': test_acc_list,
})
print(summary)
Here are the results of the experiment.
Varying n_estimators:
The accuracy on both the training and test sets generally improves as the number of estimators increases from 10 to 100. Specifically, we see training accuracy rise from 0.90950 with 10 trees to 0.99750 with 100 trees, and test accuracy improves from 0.877 to 0.900 in the same range. This trend indicates that more trees help the model learn better up to a certain point. However, when the number of trees is increased further to 150, while training accuracy reaches perfection (1.00000), test accuracy slightly decreases to 0.895. This suggests potential overfitting, diminishing its ability to generalize well to unseen data.
Varying learning_rate:
As the learning rate increases from 0.1 to 0.4, with a fixed n_estimators
of 10 and max_depth
of 4, there's a significant impact on model performance. At a learning rate of 0.1, test accuracy stands at 0.877, which slightly increases to 0.881 at a learning rate of 0.2. The highest test accuracy observed is 0.887 at a learning rate of 0.3, suggesting an optimal learning pace for this model configuration. However, increasing the learning rate further to 0.4 slightly reduces test accuracy to 0.884, indicating that is too aggressive a learning pace might skip over the optimal model parameters, affecting performance negatively.
Varying max_depth:
Varying the depth of trees from 2 to 5, with n_estimators
fixed at 10 and a learning_rate
of 0.1 shows a clear pattern. A depth of 2 yields the lowest test accuracy at 0.851, reflecting potential underfitting due to the model's limited complexity. Increasing the depth to 3 and then to 4 improves test accuracy to 0.869 and 0.877, respectively, indicating that deeper trees enable the model to capture more complex patterns effectively. At a depth of 5, test accuracy is observed at 0.881, which does not significantly surpass the accuracy at a depth of 4, suggesting a diminishing return on increasing tree depth beyond this point, with a risk of overcomplicating the model without proportional gains in generalization.
Conclusion.
In conclusion, the Gradient Boosting Classifier has significant potential for complex prediction tasks, providing precise control over model performance through hyperparameter tuning. It is important to note that the Gradient Boosting Classifier can be quite computationally intensive. The iterative nature of boosting, especially as the number of estimators increases, leads to longer training periods. It is recommended to carefully balance the factors for future applications of this classifier, potentially exploring parallel processing techniques or more efficient implementations to deal with time constraints.
References.
[1] Li, C. (n.d.). Gradient Boosting Machines. College of Computer and Information Science, Northeastern University. Retrieved from https://www.chengli.io/tutorials/gradient_boosting.pdf.
[2] Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378. https://doi.org/10.1016/s0167-9473(01)00065-2.
[3] Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189–1232. http://www.jstor.org/stable/2699986.
[4] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
[5] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer.