Bagging

Given a standard training set \(D\) of size n, bagging generates m new training sets \(D_{i}\), each of size n′, by sampling from \(D\) uniformly and with replacement. By sampling with replacement, some observations may be repeated in each \(D_{i}\).

These are the original, bootstrap, and out-of-bag datasets.

Original dataset \(D\) of size n. Let say we have a bootstrap sample \(D_{i}\) of size \(\bar n\). Then we would have n - \(\bar n\) OOB(out of bag) here \(\bar n\) is the number of unique observations in \(D_{i}\).

Now we train different models on each of \(D_{i}\) dataset and test it on the OOB dataset. We repeat this process for all the bootstrap samples and then we aggregate the results of all the models to get the final model.

In Bagging we are trying to reduce the variance of the model by training the model on different datasets and aggregating the results. Bagging is also called bootstrap aggregation.

Each learner is high variance, low bias (Because we a traning each model on different subset of data and let it overfit to an extend). Aggregating the learners reduces the variance of the ensemble. The reason for variance reduction is that individual learners would have a large variance but when we aggregate the results of all the learners the variance would be reduced. And as the bias of the learners is low the bias of the ensemble would also be low but can be more than the bias of the individual learners.

Pros:

Bagging kind of works as a regularization method as the objective of regularization is to reduce the variance (Prevent Overfitting) of the model.
The predictions of the model are more robust (More confident) as we are aggregating the results of multiple expert models.
Bagging can be used for both classification and regression.

Cons:

Bagging take up a lot of memory as we are training multiple models. This can be a problem when we are working with large datasets.
Loss of interpretability as we are aggregating the results of multiple models.
For weak learner with high bias, bagging will also carry high bias into its aggregated model.

Random Forest

Interview Questions

1. Discuss how bagging handles imbalanced datasets and its impact on the performance of the ensemble?

Bagging in the default setting may not work well with imbalanced datasets. This is because the minority class will be underrepresented in the bootstrap samples. This will lead to the model not learning the minority class well. This can be solved by oversampling the minority class in the bootstrap samples. This will help in improving the performance of the model.

Bagging

Bagging

Random Forest

Interview Questions

results matching ""

No results matching ""