Boosting

Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. Boosting is One of the methods that combine weak learners to create a strong learner.

The idea of boosting is to iteratively add weak learners each one is trained by taking into account the previous learners’ mistakes. In the end, the weak learners are combined to form a single strong learner.

We can think of boosting using residual error. The idea is to fit a model to the residuals from the previous model. Because the residuals are the errors that the previous model could not fit, a new model that focuses on them will improve the overall fit.

Gradient Boosting

Ref Wiki: Read this to get the intution behind when we train on -ve gradients (It is because -ve gradient of MSE is the residual error)

Slides :

The method which is based on residual error is not flexible enough to be applied to a wide range of loss functions also it is more prone to outliers. Gradient boosting is performing boosting using gradient descent optimization. It is a generalization of boosting to arbitrary differentiable loss functions. It is a greedy algorithm and can overfit if run for too many iterations.

We start with a base learner $F_0(x)$ and then we fit a model to the -ve gradients of the previous model. The model is fit by minimizing the loss function $L(y, F_{m-1}(x))$ wrt $F_{m-1}(x)$. This is done by fitting a weak learner $h_m(x)$ to the gradient $y - F_{m-1}(x)$ (in case of squared loss). The model is then updated by adding the new model $h_m(x)$ to the previous model $F_{m-1}(x)$.

\[F_m(x) = F_{m-1}(x) + \gamma_{m} h_m(x)\]

We also find the optimal step size $\gamma_m$ by minimizing the loss function wrt $\gamma_m$.

\[\gamma_m = argmin_{\gamma} \sum_{i=1}^{N} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))\]

We do this for $M$ iterations and then we have the final model $F_M(x)$. The final model is a weighted sum of the weak learners.

\[F_M(x) = \sum_{m=1}^{M} \gamma_m h_m(x)\]

While training the model we can also use a learning rate $\eta$ to control the contribution of each weak learner. The learning rate is used to shrink the contribution of each weak learner. This is done by multiplying the weak learner by the learning rate $\eta$. This help in reducing the overfitting of the model (Regularization).

\[F_m(x) = F_{m-1}(x) + \eta \gamma_{m} h_m(x)\]

results matching ""

    No results matching ""