📊 AIMS Coursework

Comparing Uncertainty Methods

Late 2024 — Deep Ensembles vs MC Dropout vs Deterministic

Deep Ensembles MC Dropout OoD Detection Rejection Curves

About

A comprehensive comparison of uncertainty estimation methods using Bayesian neural networks (LeNet5 with dropout) on Dirty-MNIST, MNIST, and Fashion-MNIST datasets. Covers classification with rejection, out-of-distribution detection, and density-based uncertainty.

View on GitHub →

Methods Compared

Deep Ensembles

Multiple independently trained models averaged for predictions

MC Dropout

Dropout at inference time for Bayesian approximation

Deterministic

Standard neural network (baseline)

Model Training Results

Models trained on Dirty-MNIST, evaluated on MNIST test set.

Method Accuracy NLL
Deep Ensembles 98.9% 0.0367
MC Dropout 99.0% 0.04
Deterministic 98.8% 0.067

Classification with Rejection

Ranking predictions by uncertainty and rejecting the least confident ones. Higher rejection rates → only confident predictions retained → higher accuracy.

Accuracy vs Rejection Rate

Accuracy vs Rejection Rate

NLL vs Rejection Rate

NLL vs Rejection Rate

Observations:

  • Deep Ensemble achieves highest accuracy due to prediction diversity
  • MC Dropout leverages stochasticity but slightly trails Deep Ensemble
  • • At high rejection rates (>70%), all methods converge
  • • Deep Ensemble shows best calibration (lowest NLL)

Out-of-Distribution Detection

Detecting whether inputs come from the training distribution (Dirty-MNIST) or a different distribution (Fashion-MNIST).

Deep Ensemble

Deep Ensemble OoD

AUROC: BALD 0.89 | Pred. Entropy 0.97 | Cond. Entropy 0.95

MC Dropout

MC Dropout OoD

AUROC: BALD 0.93 | Pred. Entropy 0.96 | Cond. Entropy 0.95

Key Finding: Predictive Entropy achieves highest AUROC for OoD detection. MC Dropout captures epistemic uncertainty (BALD) more effectively than Deep Ensembles in this scenario.

Key Takeaways

  • Deep Ensembles win for accuracy and calibration in rejection scenarios
  • Predictive Entropy is most reliable for OoD detection (AUROC 0.96-0.97)
  • MC Dropout is a practical alternative when training multiple models is expensive
  • Rejection-based classification useful for high-stakes applications (medical, autonomous)