🎓 Master's Thesis

Scaling Inference Time Compute for Machine Learning Engineering Agents

June 2025 — AIMS South Africa (AI for Science Program)

Inference-Time Scaling LLM Agents DeepSeek-R1 MLE-Bench AIDE Self-Consistency

Abstract

Recent findings suggest that reasoning strategies applied during inference can bring greater performance gains in LLMs than simply increasing model size. In this work, we systematically implement and evaluate inference-time scaling (ITS) strategies within an open-source, agentic framework tailored to machine-learning engineering tasks.

We enhance the AIDE agent scaffold with multiple ITS techniques—self-consistency, self-reflection, and modular task decomposition—and apply these to distilled DeepSeek-R1 models (7B, 14B, 32B). Our best agent achieves a 30% medal rate on MLE-Bench, matching OpenAI's o4-mini and surpassing GPT-4-Turbo.

30%

Medal Rate (Pass@6)

DeepSeek-32B + Decomposed Planner-Coder
Matches o4-mini • Surpasses GPT-4-Turbo

Research Questions

RQ1: Efficacy of ITS on Distilled Models

To what extent can ITS strategies improve the performance of distilled DeepSeek models on ML engineering tasks?

RQ2: Trade-offs Between Strategies

What are the performance trade-offs between different ITS strategies (self-consistency, self-reflection, task decomposition)?

Methodology

🤖 Models Tested

• DeepSeek-R1 7B (distilled)
• DeepSeek-R1 14B (distilled)
• DeepSeek-R1 32B (distilled)
Baselines: GPT-4-Turbo, o4-mini

📊 Benchmark

• MLE-Bench (10 competitions)
• Image/Text Classification
• Tabular, Seq2Seq, Audio tasks
• Pass@6 evaluation protocol

ITS Strategies Implemented

Figure 3.1: AIDE solution tree - drafting, fixing, and improvement steps

🔄

Self-Reflection

Iterative refinement based on execution feedback. Best for "competent but flawed" models.

🎲

Self-Consistency

Generate N candidates, verify and select best. Increases reliability.

🧩

Task Decomposition

Separate Planner + Coder agents. Most effective for 32B.

Figure 3.2: Self-Reflection workflow

Figure 3.3: Planner-Coder decomposition

Figure 3.4: Self-Consistency - sample N scripts, execute, pick winner

Key Results

Figure 4.1: Comparison of top-performing agents. DS-32B + Decomposed achieves 30% medal rate (Pass@6), matching o4-mini.

Figure 4.2: Any Medal % (Pass@6) across all configurations. Task decomposition provides the biggest gains for the 32B model.

Model + Strategy	Valid Submission	Above Median	Any Medal
o4-mini (baseline)	90%	70%	30%
GPT-4-Turbo (baseline)	80%	40%	20%
🏆 DS-32B + Decomposed	80%	60%	30%
DS-32B + Self-Consistency	80%	50%	10%
DS-14B + Self-Reflection	80%	30%	20%
DS-7B (all strategies)	50%	10%	0%

Pass@6 results on 10 MLE-Bench competitions

Key Findings

🚫 ITS Cannot Rescue Weak Models

The 7B model's fundamental code generation deficits meant no ITS strategy could produce medal-winning solutions. ITS amplifies capability—it doesn't create it.

🏆 Task Decomposition Wins for Capable Models

Planner-Coder separation enforces architectural coherence, avoiding library confusion. The 32B model with this strategy matches o4-mini.

🎯 Self-Reflection: Sweet Spot at 14B

Most effective for "competent but flawed" models. Doubled the 14B medal rate but caused "overthinking" in the already-capable 32B.

⚖️ No One-Size-Fits-All Strategy

The optimal ITS strategy depends on both model scale and task type. Strategy selection is a critical design decision.

Contributions

🛠️

Open-Source Agent Scaffold

Accessible, specialized agentic framework optimized for ITS strategies in ML engineering tasks.

📊

Empirical Validation

Comprehensive experiments showing ITS can elevate open-source LLMs to match proprietary SOTA.

Supervision

Supervisors: Arnol Fokam, Arnu Pretorius

Institutions: InstaDeep & Stellenbosch University, South Africa

Program: AI for Science Masters, AIMS South Africa (DeepMind Partnership)

← Back to Journey

Download Thesis View Code