Please note: This master’s thesis presentation will take place online.
Abhranil Chandra, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Sebastian Fischmeister
Recent progress in AI has primarily been driven along two axes — scaling large generative Foundation Models (FMs) trained on static internet scale data and developing better sequential decision making (DM) and reinforcement learning (RL) algorithms that allow experiential learning. Both paradigms alone have their own flaws, but together provide a scalable recipe towards general intelligence.
As AlphaGo made its famous “Move 37”, it reaffirmed Reinforcement Learning’s (RL) efficacy as a paradigm to bootstrap intelligence via interactive learning from scratch, optimizing for goal-directed behavior, self-improvement, and emergence of novel superhuman abilities. However, RL is hard to scale beyond narrow tasks or simulated environments given the scarcity of real-world decision-centric data, sparsity of feedback, hard-to-design rewards, and problems of scaling to larger models. On the contrary, recent generative FMs pretrained on static internet-scale text and image data excel at broad high-level knowledge acquisition but fail to acquire robust internal World Models, precluding their notion of agency, ability to plan and reason well, and extrapolation beyond training data.
In the thesis we focused on combining these complementary paradigms — to both build and use FMs for DM and develop DM and RL tools for improving FMs — using DM/RL as a paradigm to finetune and optimize general-purpose pretrained models to elicit better decisions beyond training data, rather than viewing it as a paradigm to bootstrap intelligence from scratch. Such broadly capable systems can then be used to empower agents that can perceive, reason, and act robustly in both physical settings to complete embodied tasks and in virtual settings to help as autonomous task completion and knowledge-discovery agents.
First, we introduce \textit{VideoAgent}, a jointly trained goal-conditioned video generation policy and self-improving simulator for embodied planning tasks. VideoAgent learns to refine its own generated plans based on a novel self-conditioning consistency objective and also using feedback from pretrained vision-language models (VLMs), without requiring ground-truth action labels or any explicit rewards. The model further leverages search to enable iterative improvement of the video plans using inference-time compute, leading to more grounded and physically plausible plans in robotic manipulation tasks.
Second, we develop a framework, \textit{Reg-ReBoot}, to investigate efficient and scalable methods to improve base non-reasoning LLMs into better reasoners without using explicit verified data or rule-based verifiers. We analyze this counterintuitive idea: fine-tuning language models using unverified and even incorrect reasoning traces leads to reasoning improvement. We show that large language models (LLMs), due to their inductive biases, can learn useful reasoning heuristics by averaging over noisy chain-of-thought (CoT) data. Our results on mathematical reasoning benchmarks reveal that noisy synthetic data can be an efficient way to bootstrap performance and decrease reliance on hard to curate verified solutions. From these insights we propose a two stage mid-training pipeline — lowering the barrier to scalable reasoning improvement.
Finally, we address the evaluation bottleneck in generative modeling by proposing \textit{ReFeR}, a multi-agent, tuning-free evaluation framework. ReFeR uses a hierarchy of pretrained LLMs and VLMs to provide automatic, scalable, and high-quality feedback for both textual and visual generations. This framework not only rivals human-level evaluation accuracy but also produces structured feedback, enabling downstream distillation and fine-tuning, and proving effective even in complex reasoning tasks.
Together, through these works, I try to contribute towards the goal of building autonomous self-improving agents: one where systems powered by foundation models leverage test-time compute, generative simulations and world models, and diverse learning signals beyond explicit rewards and human feedback to drive interactive learning and decision making.