📉 AIMS Coursework

Overtraining Analysis

Late 2024 — Decision Stumps, Trees, and Overfitting

Decision Trees Random Forest Gini Index Overfitting

About

Exploring binary classification with decision stumps and trees, focusing on overtraining tendencies and generalization capabilities. Includes Gini index calculations, optimal threshold determination, and comparing Decision Tree vs Random Forest with different hyperparameters.

View on GitHub →

Methodology

Dataset

Physics data with features: Δηjj (detajj) and mjj (massjj) for binary classification of signal vs background.

Decision Stump

  • • Gini index calculation
  • • Optimal threshold: 220 GeV (massjj)
  • • Minimizes cost/impurity

Overtraining Check

  • • Compare train vs test distributions
  • • Decision Tree (max_depth=2)
  • • Random Forest (100 estimators)

Results

Decision Stump Cut

Decision Stump Cut

Optimal threshold at 220 GeV separates signal and background

Decision Tree (max_depth=2) — Good Generalization

Decision Tree Overtraining

Train and test distributions overlap well — minimal overfitting

Random Forest (Default) — Overfits!

Random Forest Default

Severe overfitting — train/test mismatch

Random Forest (max_leaf=32) — Fixed!

Random Forest Optimized

Constrained complexity reduces overfitting

Key Takeaways

  • Decision stumps work well for simple feature separation
  • Overtraining check crucial — compare train vs test distributions
  • Random Forest can severely overfit without constraints
  • Hyperparameter tuning (max_leaf_nodes) prevents overfitting