In this article we look at bagging, or Bootstrap Aggregating, a powerful and intuitive ensemble technique that reduces variance, enhances robustness, and improves overall predictive accuracy. This technique is particularly effective for high-variance models, such as deep decision trees, where it forms the backbone of popular algorithms like Random Forest.
Instead of relying on a single model, bagging constructs an ensemble of models, each trained on a randomly sampled subset of the training data. The idea is to average the predictions of these models (for regression) or use majority voting (for classification).
Bagging does not reduce bias; it only reduces variance. It is most effective when the base models are high-variance, low-bias estimators (e.g., deep decision trees).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
We will use a dataset that contains the income of several people around the world. The data is a bit old since it refers to 1994 and has 32’561 rows.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = [
'age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income',
]
df = pd.read_csv(url, names=column_names, sep=',\\s*', engine='python', na_values='?')
print(f"The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.\n")
df.sample(5)
The dataset contains 32561 rows and 15 columns.
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26934 | 24 | Local-gov | 452640 | Some-college | 10 | Never-married | Tech-support | Not-in-family | White | Male | 14344 | 0 | 50 | United-States | >50K |
| 15798 | 40 | Private | 209833 | HS-grad | 9 | Never-married | Craft-repair | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 29713 | 19 | Local-gov | 243960 | Some-college | 10 | Never-married | Sales | Own-child | White | Female | 0 | 0 | 16 | United-States | <=50K |
| 16011 | 39 | Private | 186420 | HS-grad | 9 | Divorced | Adm-clerical | Not-in-family | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 30899 | 30 | Federal-gov | 127610 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Female | 0 | 0 | 35 | United-States | <=50K |
The meaning of the columns is the following:
- age: discrete (from 17 to 90)
- workclass (private, federal-government, etc): nominal (9 categories)
- fnlwgt: the final weight (the number of people the census believes the entry represents): discrete
- education (the highest level of education obtained): ordinal (16 categories)
- education-num (the number of years of education): discrete (from 1 to 16)
- marital-status: nominal (7 categories)
- occupation (transport-moving, craft-repair, etc): nominal (15 categories)
- relationship in family (unmarried, not in the family, etc): nominal (6 categories)
- race: nominal (5 categories)
- sex: nominal (2 categories)
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week (hours worked per week(): discrete (from 1 to 99)
- native-country: nominal (42 countries)
- income (whether or not an individual makes more than $50,000 annually): boolean (≤$50k, >$50k)
df = df.dropna()
print(f"# rows after removing missing values: {len(df)}")
# rows after removing missing values: 30162
features_to_use = [
'age',
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
]
X = df[features_to_use].copy()
label_encoders = {}
categorical_cols = X.select_dtypes(include=['object']).columns
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)
y = df['income'].copy()
y = (y == '>50K').astype(int)
print(f"# positive entries: {y.mean():.2%}, # negative entries: {(1 - y).mean():.2%}")
print(f"Final number of features (after one-hot encoding): {X.shape[1]}")
# positive entries: 24.89%, # negative entries: 75.11%
Final number of features (after one-hot encoding): 94
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
print(f"# train samples: {len(X_train):,}, # test samples: {len(X_test):,}")
# train samples: 22,621, # test samples: 7,541
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
y_pred_single = single_tree.predict(X_test)
accuracy_single = accuracy_score(y_test, y_pred_single)
f1_single = f1_score(y_test, y_pred_single)
print(f"Test Accuracy: {accuracy_single:.4f} ({accuracy_single*100:.2f}%)")
print(f"F1 Score: {f1_single:.4f}")
cv_scores_single = cross_val_score(single_tree, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation Accuracy: {cv_scores_single.mean():.4f} (+/- {cv_scores_single.std():.4f})")
Test Accuracy: 0.8124 (81.24%)
F1 Score: 0.6197
Cross-validation Accuracy: 0.8147 (+/- 0.0042)
bagging_clf = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=20,
max_samples=0.5,
max_features=1.0,
bootstrap=True,
bootstrap_features=True,
random_state=42,
n_jobs=-1,
)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
f1_bagging = f1_score(y_test, y_pred_bagging)
print(f"Test Accuracy: {accuracy_bagging:.4f} ({accuracy_bagging*100:.2f}%)")
print(f"F1 Score: {f1_bagging:.4f}")
cv_scores_bagging = cross_val_score(bagging_clf, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation Accuracy: {cv_scores_bagging.mean():.4f} (+/- {cv_scores_bagging.std():.4f})")
Test Accuracy: 0.8540 (85.40%)
F1 Score: 0.6730
Cross-validation Accuracy: 0.8574 (+/- 0.0043)