🎯 AIMS Coursework

Active Learning Core Concepts

Late 2024 — Query Strategies on Forest Covertype Dataset

Uncertainty Sampling Query-by-Committee Entropy Sampling Margin Sampling

Why Active Learning?

In traditional ML, labeled data is a prerequisite — but obtaining labels is expensive, especially in domains like healthcare or NLP. Active learning strategically selects the most informative data points for labeling, significantly reducing the amount of labeled data needed while maintaining or improving model performance.

View on GitHub →

Dataset: Forest Covertype

The Forest Covertype Dataset contains cartographic features to predict forest cover types. Visualized in 2D using PCA.

Dataset Visualization

11,000

Total samples

10,000

Training

35

Initial labels

9,965

Unlabeled pool

The Performance Gap

Comparing model performance with minimal labels (35 samples) vs. full training data (10,000 samples) highlights the opportunity for active learning.

Performance Gap

Query Strategies Compared

Uncertainty Sampling

  • Least Confident: Lowest max probability
  • Margin: Smallest gap between top-2 classes
  • Entropy: Highest information-theoretic uncertainty

Query-by-Committee (QBC)

Measures disagreement among an ensemble of models trained on bootstrapped subsets.

Query-by-Committee (QBC)

QBC Confidence Interval

Entropy Sampling

Entropy Sampling

Margin Sampling

Margin Sampling

Random Sampling (Baseline)

Random Sampling

All Strategies Combined

All Strategies Combined

Observations:

  • 🥇 QBC & Entropy: Best early-stage performance, minimizing labeling cost
  • 🥈 Margin Sampling: Competitive but slightly less robust
  • 🥉 Random: Baseline, least improvement over time
  • 📊 Variance: QBC shows slightly more variance (committee-based approach)

Key Takeaways

  • Active learning works — reaches near-full-data performance with far fewer labels
  • QBC and Entropy are most effective in early stages
  • Confidence intervals help assess strategy robustness
  • Real-world impact: Reduces labeling costs in expensive domains