In this article we look at bagging, or Bootstrap Aggregating, a powerful and intuitive ensemble technique that reduces variance, enhances robustness, and improves overall predictive accuracy. This technique is particularly effective for high-variance models, such as deep decision trees, where it forms the backbone of popular algorithms like Random Forest.

Instead of relying on a single model, bagging constructs an ensemble of models, each trained on a randomly sampled subset of the training data. The idea is to average the predictions of these models (for regression) or use majority voting (for classification).

Bagging does not reduce bias; it only reduces variance. It is most effective when the base models are high-variance, low-bias estimators (e.g., deep decision trees).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import LabelEncoder
import seaborn as sns

We will use a dataset that contains the income of several people around the world. The data is a bit old since it refers to 1994 and has 32’561 rows.

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race', 'sex',
    'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income',
]

df = pd.read_csv(url, names=column_names, sep=',\\s*', engine='python', na_values='?')
print(f"The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.\n")
df.sample(5)
The dataset contains 32561 rows and 15 columns.
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
26934 24 Local-gov 452640 Some-college 10 Never-married Tech-support Not-in-family White Male 14344 0 50 United-States >50K
15798 40 Private 209833 HS-grad 9 Never-married Craft-repair Not-in-family White Male 0 0 40 United-States <=50K
29713 19 Local-gov 243960 Some-college 10 Never-married Sales Own-child White Female 0 0 16 United-States <=50K
16011 39 Private 186420 HS-grad 9 Divorced Adm-clerical Not-in-family White Female 0 0 40 United-States <=50K
30899 30 Federal-gov 127610 Bachelors 13 Never-married Adm-clerical Not-in-family White Female 0 0 35 United-States <=50K

The meaning of the columns is the following:

  • age: discrete (from 17 to 90)
  • workclass (private, federal-government, etc): nominal (9 categories)
  • fnlwgt: the final weight (the number of people the census believes the entry represents): discrete
  • education (the highest level of education obtained): ordinal (16 categories)
  • education-num (the number of years of education): discrete (from 1 to 16)
  • marital-status: nominal (7 categories)
  • occupation (transport-moving, craft-repair, etc): nominal (15 categories)
  • relationship in family (unmarried, not in the family, etc): nominal (6 categories)
  • race: nominal (5 categories)
  • sex: nominal (2 categories)
  • capital-gain: continuous
  • capital-loss: continuous
  • hours-per-week (hours worked per week(): discrete (from 1 to 99)
  • native-country: nominal (42 countries)
  • income (whether or not an individual makes more than $50,000 annually): boolean (≤$50k, >$50k)
df = df.dropna()
print(f"# rows after removing missing values: {len(df)}")
# rows after removing missing values: 30162
features_to_use = [
    'age',
    'workclass',
    'education',
    'marital-status', 
    'occupation',
    'relationship',
    'race',
    'sex', 
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
]

X = df[features_to_use].copy()

label_encoders = {}
categorical_cols = X.select_dtypes(include=['object']).columns

X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

y = df['income'].copy()
y = (y == '>50K').astype(int)
print(f"# positive entries: {y.mean():.2%}, # negative entries: {(1 - y).mean():.2%}")
print(f"Final number of features (after one-hot encoding): {X.shape[1]}")
# positive entries: 24.89%, # negative entries: 75.11%
Final number of features (after one-hot encoding): 94
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
print(f"# train samples: {len(X_train):,}, # test samples: {len(X_test):,}")
# train samples: 22,621, # test samples: 7,541
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)

y_pred_single = single_tree.predict(X_test)
accuracy_single = accuracy_score(y_test, y_pred_single)
f1_single = f1_score(y_test, y_pred_single)

print(f"Test Accuracy: {accuracy_single:.4f} ({accuracy_single*100:.2f}%)")
print(f"F1 Score: {f1_single:.4f}")

cv_scores_single = cross_val_score(single_tree, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation Accuracy: {cv_scores_single.mean():.4f} (+/- {cv_scores_single.std():.4f})")
Test Accuracy: 0.8124 (81.24%)
F1 Score: 0.6197
Cross-validation Accuracy: 0.8147 (+/- 0.0042)
bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=20,
    max_samples=0.5,
    max_features=1.0,
    bootstrap=True,
    bootstrap_features=True,
    random_state=42,
    n_jobs=-1,
)

bagging_clf.fit(X_train, y_train)

y_pred_bagging = bagging_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
f1_bagging = f1_score(y_test, y_pred_bagging)

print(f"Test Accuracy: {accuracy_bagging:.4f} ({accuracy_bagging*100:.2f}%)")
print(f"F1 Score: {f1_bagging:.4f}")

cv_scores_bagging = cross_val_score(bagging_clf, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation Accuracy: {cv_scores_bagging.mean():.4f} (+/- {cv_scores_bagging.std():.4f})")
Test Accuracy: 0.8540 (85.40%)
F1 Score: 0.6730
Cross-validation Accuracy: 0.8574 (+/- 0.0043)