Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

d1 : Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Abstract

Recent large language models (LLMs) have demonstrated strong reasoning capabilities, often enhanced through online reinforcement learning (RL), particularly within the left-to-right autoregressive (AR) generation paradigm. In contrast, diffusion-based LLMs (dLLMs), which generate text in a coarse-to-fine manner, have shown competitive language modeling performance but their reasoning abilities remain less explored. To address this gap, we propose d1, a framework for adapting pre-trained masked dLLMs into effective reasoning agents using a combination of supervised finetuning (SFT) and RL. Specifically, we introduce two techniques tailored for reasoning: (a) a masked SFT procedure that distills reasoning patterns and encourages self-improvement from existing datasets, and (b) diffu-GRPO, a novel critic-free, policy-gradient RL algorithm—representing the first integration of policy gradient methods with masked dLLMs. We conduct empirical evaluations across mathematical, planning and coding benchmarks and find that d1 substantially improves reasoning performance over a strong dLLM baseline. Code is available at https://dllm-reasoning.github.io/.