In the previous post, we explored the rise of diffusion models in generative AI. The main takeaway was that instead of acting as simple “expected value” predictors, like classic neural networks do, diffusion models can map out complex, arbitrary distributions. This mathematical property is exactly why they are able to generate crisp, detailed images rather than averaging everything into a blurry blob.
The robot learning community is no stranger to the “average the correct answers” kind of behavior either, especially in imitation learning paradigms (our way of teaching robots from expert data). Think of an autonomous car going straight towards a pole: its training dataset has seen expert drivers avoid the pole by going right or by going left. Both methods would be completely valid. A standard neural network, however, doesn’t see two distinct options. It tries to find a mathematical compromise between the two, averages the steering angles, and drives straight into the pole (not a fun ride, if you ask me).
Recently, the robot learning community realized that the core strength of diffusion models could fix this. By allowing the AI to evaluate multiple valid possibilities, without averaging them into a disastrous middle ground, a diffusion policy can confidently commit to just one, higher quality path. Going from generating digital art to steering a physical robot might sound like a wild leap, but it actually solves a massive headache in traditional AI control.

How to leverage Diffusion Models for Robot Learning
Up until now we’ve been talking about the world of pixels and image generation. But robots don’t output pixels, they use physical trajectories. So, how do we take an architecture built for generating DALL-E images and use it to control a thousands of dollars worth robotic arm?
The conceptual leap is actually surprisingly elegant: we simply swap out the “image” for an “action trajectory“. Instead of denoising a 2D grid of RGB colors, the diffusion model denoises a sequence of motor commands over time, like joint angles, velocities, or the XYZ coordinates of a robotic gripper. The model still starts with pure, random noise, but as it iteratively refines that noise, it refines a smooth, physically viable path for the robot to follow. The mathematical process of “drawing a picture” becomes the exact same process as “planning a complex movement.”
However, getting a robot to successfully execute these generated plans in the real world takes a bit more finesse than just prompting an image generator. The ‘how’ in training diffusion policies is rarely a single train run. It uses a two-stage approach: first building an imitation learning prior and then refining it via reinforcement learning.
Imitation Learning Prior
In the robotics community, we often use Imitation Learning (Behavior Cloning) to jumpstart a model. Since it is identical to supervised learning, the synergy with diffusion models is straightforward: the same noising and denoising process is used. The only difference here is that we are not generating images and we generate robot actions instead.

Results from “Diffusion Policy: Visuomotor Policy
Learning via Action Diffusion”, Cheng et al.
Reinforcement Learning Finetuning
Imitation is a great start, but it has a few problems. First, the performance will only be as good as the data it is provided. Second, counterintuitively, if the data is taken from an expert is never wrong, then the policy we are training will never know how to recover from the mistakes (which are basically ensured in the world of statistical learners, such as neural networks).
Reinforcement learning overcomes this problems: it doesnt want to imitate anyone, it wants to maximize it’s objective reward. Also, since it is closed-loop, it will face its errors and will learn how to recover from them.
But here is where we hit a wall: diffusion models are inherently built to minimize a reconstruction loss, they are trained via maximum likelihood to make their generated outputs look exactly like the training data. Reinforcement Learning, on the other hand, doesn’t care about mimicking data (it doesn’t even have target data): it cares about maximizing a reward, and the optimal behavior to achieve that reward might not even exist in any fixed dataset.
Merging these two worlds is not trivial. To train a diffusion model with RL using standard gradients, you would theoretically need to backpropagate the reward signal through the entire reverse diffusion chain. That means calculating gradients through dozens or hundreds of sequential denoising steps, which is a total nightmare for memory and stability.
The Reinforcement Learning Mental Shift
To grasp this fully, a mental shift helps. We typically view Reinforcement Learning strictly as reward optimization. While true, RL is fundamentally a mathematical framework for optimizing any non-differentiable objective function.
Standard deep learning requires smooth, continuous math to calculate gradients and update weights. RL bypasses this constraint. It observes final outcomes and increases the probability of the network outputting actions that led to those results. We usually define these results as task success, but the target can be anything. For example, you could use RL to minimize a robot arm’s total energy consumption or maximize the physical smoothness of a trajectory.
With diffusion models, the optimization changes. We are not just updating the network to output a specific final action. We are optimizing the intermediate denoising steps the network takes to build that action.
How to train Diffusion Policies with RL ?
So, how do we combine diffusion’s “imitate the data” with RL’s “maximize the score”? Standard RL expects a direct, one-step link between seeing a state and taking an action, but diffusion takes dozens of intermediate “denoising” steps. Here are three clever ways researchers are bridging this gap (click on the title to open explanation):
In traditional RL, the “Actor-Critic” setup is like an improvisational actor trying out moves, and a strict director (the Critic) scoring them. In this version, the diffusion model becomes the Actor.
How it works: We train a “Critic” network to accurately predict the final reward of a robot’s action. During training, the Critic evaluates the batches of trajectories the diffusion model generates. We then use standard math to tweak the diffusion model, essentially telling it: “Generate more of what the Critic likes.” To keep the robot from glitching out, we add a penalty if its new moves drift too far away from the realistic, human data it originally learned from.
What if we want to avoid retraining the diffusion model entirely? This approach leaves the original model untouched and instead steers it while it generates an action.
How it works: Alongside the diffusion model, we train a separate “Value” network that can look at a partially noisy action and predict if it’s heading toward a good outcome. As the diffusion model does its standard job of cleaning up the noise step-by-step, the Value network peeks in and calculates tiny mathematical nudges. We inject these nudges directly into the math. It’s exactly like having a backseat driver gently pushing the steering wheel to keep you on the optimal path, without having to teach you how to drive from scratch.
This approach requires a mental shift: what if the “action” isn’t the physical movement of the robot at all?
How it works: Algorithms like DPPO treat the diffusion process itself as the RL game. The “environment” is the noisy state, and the AI’s “move” is simply deciding how much noise to remove at that specific step. By breaking the process down this way, we can unleash off-the-shelf, highly stable RL algorithms like PPO. The reward only comes at the very end of the denoising process, but PPO can trace that success backward to credit the specific noise-removal decisions that created the winning trajectory when applied on enough data.
The Secret Sauce: Action Chunking
Earlier, we noted that to make a diffusion model drive a robot, we swap out a static image for an “action trajectory.” The AI denoises a sequence of motor commands over time, rather than just computing its next immediate move. While diffusion provides the mathematical engine, predicting a full sequence is a standalone paradigm shift in AI control. Generating a clear plan is only half the battle; executing it in the physical world requires an architectural choice that pairs perfectly with diffusion policies: Action Chunking.
Thinking in Sequences
When driving a car, you don’t consciously tell your hands to “turn the wheel 0.5 degrees for 10 milliseconds, then re-evaluate.” You plan sequences, like turning into the left lane over three seconds.
Traditional AI control did the opposite. Policies were typically “Markovian”: the AI observes the current state and predicts a single action for that exact millisecond. Action chunking abandons this one-step-at-a-time convention. The neural network predicts a “chunk” of future actions all at once, say, the next 10 to 50 timesteps.
In the real world, however, the robot doesn’t execute the entire sequence blindly. It uses a receding-horizon approach: it starts executing the first few steps, but before finishing, it re-evaluates the environment and predicts a new chunk. This provides the benefit of a longer-term plan while remaining reactive to changes.
Why Predict the Future?
Predicting an entire trajectory at once solves several practical deployment issues. First, it buys time. Vision-Language-Action (VLA) models are powerful, but computationally heavy. Forcing a large model to compute commands a hundred times a second leads to lag. Chunking allows the robot to execute the current sequence while the AI computes the next one in the background.
Second, chunking mitigates covariate shift. When a model predicts actions every millisecond, tiny errors compound. A slight miscalculation puts the robot in an unfamiliar position, causing a larger mistake on the next step. Committing to a sequence and checking the environment less frequently reduces these compounding feedback loops.
Finally, chunking forces the AI to learn meaningful representations. If an AI only predicts one millisecond ahead, it can cheat. Since physical motion is continuous, the AI can often just repeat its last motor command and minimize its loss. Predicting further into the future breaks this shortcut, requiring the model to understand the scene’s geometry and plan a coherent path.
The Catch: Why Action Chunking is Hard
Putting action chunking into practice comes with hurdles. The first is a Goldilocks problem: choosing the right chunk size. If the chunk is too short, you lose the planning benefits. If it is too long, the policy becomes rigid and fails to complete tasks.
Furthermore, chunking struggles in highly dynamic environments. Imagine driving blindfolded and being asked to accelerate for 2 whole seconds, what would happen if someone jumps in front of the car ? The policy needs to be highly responsive, however if only part of it is ever actuated, then the part of the plan that is not executed will not be optimized.
The biggest challenge is integrating chunking with Reinforcement Learning (RL). Imitation learning easily adopts chunking, but RL traditionally assumes that optimal policies only need to look one step ahead. Naively plugging action chunking into standard RL algorithms, which typically rely on Gaussian policies, causes performance to collapse.
Predicting the Future in RL: Proposed Methods
Researchers have proposed several ways to make Reinforcement Learning handle action sequences. Here is a quick look at how the community is tackling the problem.
Proposed in papers like OPAL (Ajay et al., 2021) and SUPE-GT (Wilcoxson et al., 2024). You use two policies instead of one. A high-level policy picks a broad goal, and a low-level policy executes the multi-step sequence to get there.
The catch: Training both at the same time is highly unstable.
Proposed in Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning (Seo & Abbeel, 2024). This method chops the continuous action space into discrete bins and refines the sequence step by step.
The catch: Forcing fluid motion into rigid boxes limits how flexible the robot can be.
Explored in TOP-ERL (Li et al., 2025). The AI predicts the parameters for a pre-defined mathematical curve rather than raw motor commands.
The catch: It produces smooth motion but lacks real-time adaptability for messy tasks.
Seen in Value Prediction Network (Oh et al., 2017) and MuZero (Schrittwieser et al., 2020). The AI builds an internal simulation of the world and tests action sequences in its head before moving.
The catch: These mental simulations are computationally heavy and hard to use with offline data.
Introduced in Reinforcement Learning with Action Chunking (Li et al., 2026). It evaluates raw action sequences directly and uses expressive flow-matching models to handle complex data distributions.
The catch: It introduces additional computational costs associated with best-of-N sampling during operation.
The Current Landscape
Every proposed method requires a compromise. You have to sacrifice stability with HRL, flexibility with Factorized Critics and Movement Primitives, efficiency with Latent Planning, or take on added computational sampling costs with Q-Chunking. The robotics community is actively testing all of these approaches to find the best balance for real-world deployments.
The Bottlenecks and the Future
We have established that merging diffusion models, reinforcement learning, and action chunking creates a highly capable robotic brain. But we are not quite ready to put humanoid robots in every home.
The biggest roadblock right now is speed. Diffusion models are inherently slow because turning random noise into a crisp action plan requires dozens of mathematical steps. While action chunking buys some time by planning ahead, running these heavy computations natively on a physical robot is taxing. Furthermore, methods like Q-Chunking solve the RL exploration problem by evaluating multiple possible futures at once, which adds a significant computational burden. Finally, there is the reaction problem. Chunking is fantastic for smooth, predictable tasks like folding clothes, but it is terrible for catching a falling glass. Even though the real-world deployment uses receding horizon and only plays out a part of the plan, this approach still doesn’t fully fit the reinforcement learning approaches.
While I can’t predict the future (unlike the dynamic latent models) I can only guess the community will be tackling these exact problems in the foreseeable future. To solve the speed issue, researchers are pushing towards faster math, exploring techniques like Consistency Models that aim to compress the long denoising process down to a single leap. To fix the reaction problem, a major milestone will be creating models that automatically determine their own chunk boundaries rather than relying on manual guessing and tuning. The AI will learn to predict long sequences when the coast is clear, and instantly switch to short, rapid predictions when the environment gets chaotic.
Conclusion
To wrap things up, we are looking at a fundamental shift in how we build brains for robots. For years, robot learning was stuck trying to squeeze messy, unpredictable real-world data into strict, deterministic, one-step boxes. The result was often an AI that compromised, averaged out the best solutions, and ended up crashing the car. Diffusion policies, paired with action chunking, flip that entire script.
By moving from rigid, single-step control to generative, sequence-based behavior, we are teaching robots to navigate the physical world the same way image generators navigate pixels. They no longer average their options or stumble forward millisecond by millisecond. Instead, they confidently pick one valid path and commit to a cohesive, multi-step plan, smoothly filtering out the noise along the way. We still have to iron out the latency and compute issues, but the resulting leap in physical dexterity is undeniable. The generative AI revolution has finally grown hands, and it is going to be exciting to see what it builds next.