{"id":607,"date":"2026-06-04T13:54:30","date_gmt":"2026-06-04T13:54:30","guid":{"rendered":"https:\/\/florentinudrea.net\/?p=607"},"modified":"2026-06-04T13:54:31","modified_gmt":"2026-06-04T13:54:31","slug":"meet-the-diffusion-policy-generative-ais-leap-into-the-physical-world","status":"publish","type":"post","link":"https:\/\/florentinudrea.net\/index.php\/2026\/06\/04\/meet-the-diffusion-policy-generative-ais-leap-into-the-physical-world\/","title":{"rendered":"Meet the Diffusion Policy: Generative AI\u2019s Leap into the Physical World"},"content":{"rendered":"\n<p>In <a href=\"https:\/\/florentinudrea.net\/index.php\/2026\/04\/17\/diffusion-models-the-secret-sauce-of-generative-ai\/\">the previous post<\/a>, we explored the rise of diffusion models in generative AI. The main takeaway was that instead of acting as simple &#8220;expected value&#8221; predictors, like classic neural networks do, diffusion models can map out complex, arbitrary distributions. This mathematical property is exactly why they are able to generate crisp, detailed images rather than averaging everything into a blurry blob.<\/p>\n\n\n\n<p>The robot learning community is no stranger to the &#8220;average the correct answers&#8221; kind of behavior either, especially in imitation learning paradigms (our way of teaching robots from expert data). Think of an autonomous car going straight towards a pole: <strong>its training dataset has seen expert drivers avoid the pole by going right or by going left<\/strong>. Both methods would be completely valid. A standard neural network, however, doesn&#8217;t see two distinct options. <strong>It tries to find a mathematical compromise between the two, averages the steering angles, and drives straight into the pole<\/strong> (<em>not a fun ride, if you ask me<\/em>).<\/p>\n\n\n\n<p>Recently, the robot learning community realized that the core strength of diffusion models could fix this. <strong>By allowing the AI to evaluate multiple valid possibilities, without averaging them into a disastrous middle ground, a diffusion policy can confidently commit to just one, higher quality path<\/strong>. Going from generating digital art to steering a physical robot might sound like a wild leap, but it actually solves a massive headache in traditional AI control.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"409\" src=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image.png\" alt=\"\" class=\"wp-image-608\" srcset=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image.png 600w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image-300x205.png 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><figcaption class=\"wp-element-caption\">A solution to the autonomous car example\u2026 maybe.<\/figcaption><\/figure>\n<\/div>\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"How_to_leverage_Diffusion_Models_for_Robot_Learning\"><\/span><strong>How to leverage Diffusion Models for Robot Learning<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Up until now we\u2019ve been talking about the world of pixels and image generation. But robots don&#8217;t output pixels, they use physical trajectories. So, how do we take an architecture built for generating DALL-E images and use it to control a thousands of dollars worth robotic arm?<\/p>\n\n\n\n<p>The conceptual leap is actually surprisingly elegant: we simply swap out the &#8220;image&#8221; for an &#8220;<strong>action trajectory<\/strong>&#8220;. Instead of denoising a 2D grid of RGB colors, the diffusion model denoises a sequence of motor commands over time, like joint angles, velocities, or the XYZ coordinates of a robotic gripper. The model still starts with pure, random noise, but as it iteratively refines that noise, it refines a smooth, physically viable path for the robot to follow. The mathematical process of &#8220;drawing a picture&#8221; becomes the exact same process as &#8220;planning a complex movement.&#8221;<\/p>\n\n\n\n<p>However, getting a robot to successfully execute these generated plans in the real world takes a bit more finesse than just prompting an image generator. The \u2018how\u2019 in training diffusion policies is rarely a single train run. It uses a two-stage approach: first building an <strong>imitation learning prior<\/strong> and then refining it via <strong>reinforcement learning<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Imitation_Learning_Prior\"><\/span>Imitation Learning Prior<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>In the robotics community, we often use <strong>Imitation Learning (Behavior Cloning)<\/strong> to jumpstart a model. Since it is identical to supervised learning, the synergy with diffusion models is straightforward: <strong>the same noising and denoising process is used<\/strong>. The only difference here is that we are not generating images and we generate robot actions instead.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"567\" src=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image-1-1024x567.png\" alt=\"\" class=\"wp-image-615\" style=\"aspect-ratio:1.8060874823624269;width:765px;height:auto\" srcset=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image-1-1024x567.png 1024w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image-1-300x166.png 300w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image-1-768x426.png 768w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/05\/image-1.png 1440w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Diffusion policy showing multimodal behavior wrt. other techniques in pushing a T block with an end effector (blue dot).<br>Results from <em>&#8220;Diffusion Policy: Visuomotor Policy<br>Learning via Action Diffusion&#8221;<\/em>, Cheng et al. <\/figcaption><\/figure>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Reinforcement_Learning_Finetuning\"><\/span>Reinforcement Learning Finetuning<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>Imitation is a great start, but it has a few problems. First, the <strong>performance will only be as good as the data it is provided<\/strong>. Second, counterintuitively, if the data is taken from an expert is never wrong, then <strong>the policy we are training will never know how to recover from the mistakes<\/strong> (which are basically ensured in the world of statistical learners, such as neural networks).<\/p>\n\n\n\n<p><strong>Reinforcement learning overcomes this problems<\/strong>: it doesnt want to imitate anyone, it wants to maximize it\u2019s objective reward. Also, since it is closed-loop, it will face its errors and will learn how to recover from them.<\/p>\n\n\n\n<p>But here is where we hit a wall: <strong>diffusion models are inherently built to minimize a reconstruction loss<\/strong>, they are trained via maximum likelihood to make their generated outputs look exactly like the training data. <strong>Reinforcement Learning<\/strong>, on the other hand, <strong>doesn&#8217;t care about mimicking data (it doesn&#8217;t even have target data)<\/strong>: it cares about maximizing a reward, and the optimal behavior to achieve that reward might not even exist in any fixed dataset.<\/p>\n\n\n\n<p><strong>Merging these two worlds is not trivial<\/strong>. To train a diffusion model with RL using standard gradients, you would theoretically need to backpropagate the reward signal through the entire reverse diffusion chain. That means calculating gradients through dozens or hundreds of sequential denoising steps, which is a total nightmare for memory and stability.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_Reinforcement_Learning_Mental_Shift\"><\/span><strong>The Reinforcement Learning Mental Shift<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>To grasp this fully, a mental shift helps. We typically view Reinforcement Learning strictly as reward optimization. While true, <strong>RL is fundamentally a mathematical framework for optimizing any non-differentiable objective function<\/strong>.<\/p>\n\n\n\n<p>Standard deep learning requires smooth, continuous math to calculate gradients and update weights. RL bypasses this constraint. It observes final outcomes and increases the probability of the network outputting actions that led to those results. We usually define these results as task success, but the target can be anything. For example, you could use RL to minimize a robot arm&#8217;s total energy consumption or maximize the physical smoothness of a trajectory.<\/p>\n\n\n\n<p><strong>With diffusion models, the optimization changes. We are not just updating the network to output a specific final action. We are optimizing the intermediate denoising steps the network takes to build that action.<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"How_to_train_Diffusion_Policies_with_RL\"><\/span>How to train Diffusion Policies with RL ?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>So, how do we combine diffusion&#8217;s &#8220;imitate the data&#8221; with RL&#8217;s &#8220;maximize the score&#8221;? Standard RL expects a direct, one-step link between seeing a state and taking an action, but diffusion takes dozens of intermediate &#8220;denoising&#8221; steps. Here are three clever ways researchers are bridging this gap (<em>click on the title to open explanation<\/em>):<\/p>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-1&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-1-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-1\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\">Diffusion Q-Learning<\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-1\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-1-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p>In traditional RL, the &#8220;Actor-Critic&#8221; setup is like an improvisational actor trying out moves, and a strict director (the Critic) scoring them. In this version, the diffusion model becomes the Actor.<\/p>\n\n\n\n<p><strong>How it works:<\/strong> We train a &#8220;Critic&#8221; network to accurately predict the final reward of a robot&#8217;s action. During training, the Critic evaluates the batches of trajectories the diffusion model generates. We then use standard math to tweak the diffusion model, essentially telling it: &#8220;Generate more of what the Critic likes.&#8221; To keep the robot from glitching out, we add a penalty if its new moves drift too far away from the realistic, human data it originally learned from.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-2&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-2-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-2\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\">Reward-guided Diffusion<\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-2\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-2-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p>What if we want to avoid retraining the diffusion model entirely? This approach leaves the original model untouched and instead steers it <em>while<\/em> it generates an action.<\/p>\n\n\n\n<p><strong>How it works:<\/strong> Alongside the diffusion model, we train a separate &#8220;Value&#8221; network that can look at a <em>partially noisy<\/em> action and predict if it\u2019s heading toward a good outcome. As the diffusion model does its standard job of cleaning up the noise step-by-step, the Value network peeks in and calculates tiny mathematical nudges. We inject these nudges directly into the math. It\u2019s exactly like having a backseat driver gently pushing the steering wheel to keep you on the optimal path, without having to teach you how to drive from scratch.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-3&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-3-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-3\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\">Diffusion-PPO<\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-3\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-3-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p>This approach requires a mental shift: what if the &#8220;action&#8221; isn&#8217;t the physical movement of the robot at all?<\/p>\n\n\n\n<p><strong>How it works:<\/strong> Algorithms like DPPO treat <em>the diffusion process itself<\/em> as the RL game. The &#8220;environment&#8221; is the noisy state, and the AI&#8217;s &#8220;move&#8221; is simply deciding how much noise to remove at that specific step. By breaking the process down this way, we can unleash off-the-shelf, highly stable RL algorithms like PPO. The reward only comes at the very end of the denoising process, but PPO can trace that success backward to credit the specific noise-removal decisions that created the winning trajectory when applied on enough data.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_Secret_Sauce_Action_Chunking\"><\/span><strong>The Secret Sauce: Action Chunking<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Earlier, we noted that to make a diffusion model drive a robot, we swap out a static image for an &#8220;action trajectory.&#8221; The AI denoises a sequence of motor commands over time, rather than just computing its next immediate move. While diffusion provides the mathematical engine, predicting a full sequence is a standalone paradigm shift in AI control. Generating a clear plan is only half the battle; executing it in the physical world requires an architectural choice that pairs perfectly with diffusion policies: <strong>Action Chunking<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Thinking_in_Sequences\"><\/span>Thinking in Sequences<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>When driving a car, you don&#8217;t consciously tell your hands to &#8220;turn the wheel 0.5 degrees for 10 milliseconds, then re-evaluate.&#8221; You plan sequences, like turning into the left lane over three seconds.<\/p>\n\n\n\n<p>Traditional AI control did the opposite. Policies were typically &#8220;Markovian&#8221;: the AI observes the current state and predicts a single action for that exact millisecond. Action chunking abandons this one-step-at-a-time convention. The neural network predicts a &#8220;chunk&#8221; of future actions all at once, say, the next 10 to 50 timesteps.<\/p>\n\n\n\n<p>In the real world, however, the robot doesn&#8217;t execute the entire sequence blindly. It uses a <strong>receding-horizon<\/strong> approach: it starts executing the first few steps, but before finishing, it re-evaluates the environment and predicts a new chunk. This provides the benefit of a longer-term plan while remaining reactive to changes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Why_Predict_the_Future\"><\/span>Why Predict the Future?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>Predicting an entire trajectory at once solves several practical deployment issues. <strong>First, it buys time<\/strong>. Vision-Language-Action (VLA) models are powerful, but computationally heavy. Forcing a large model to compute commands a hundred times a second leads to lag. Chunking allows the robot to execute the current sequence while the AI computes the next one in the background.<\/p>\n\n\n\n<p>Second, <strong>chunking mitigates covariate shift<\/strong>. When a model predicts actions every millisecond, tiny errors compound. A slight miscalculation puts the robot in an unfamiliar position, causing a larger mistake on the next step. Committing to a sequence and checking the environment less frequently reduces these compounding feedback loops.<\/p>\n\n\n\n<p>Finally, chunking <strong>forces the AI to learn meaningful representations<\/strong>. If an AI only predicts one millisecond ahead, it can cheat. Since physical motion is continuous, the AI can often just repeat its last motor command and minimize its loss. Predicting further into the future breaks this shortcut, requiring the model to understand the scene&#8217;s geometry and plan a coherent path.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_Catch_Why_Action_Chunking_is_Hard\"><\/span>The Catch: Why Action Chunking is Hard<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p id=\"p-rc_c5ba560ff95aaf8a-87\">Putting action chunking into practice comes with hurdles. The first is a Goldilocks problem: choosing the right chunk size. If the chunk is too short, you lose the planning benefits. If it is too long, the policy becomes rigid and fails to complete tasks. <\/p>\n\n\n\n<p id=\"p-rc_c5ba560ff95aaf8a-87\">Furthermore, chunking struggles in highly dynamic environments. Imagine driving blindfolded and being asked to accelerate for 2 whole seconds, what would happen if someone jumps in front of the car ? The policy needs to be highly responsive, however if only part of it is ever actuated, then the part of the plan that is not executed will not be optimized.<\/p>\n\n\n\n<p id=\"p-rc_c5ba560ff95aaf8a-88\">The biggest challenge is integrating chunking with Reinforcement Learning (RL). Imitation learning easily adopts chunking, but RL traditionally assumes that optimal policies only need to look one step ahead. Naively plugging action chunking into standard RL algorithms, which typically rely on Gaussian policies, causes performance to collapse.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"p-rc_c5ba560ff95aaf8a-89\"><span class=\"ez-toc-section\" id=\"Predicting_the_Future_in_RL_Proposed_Methods\"><\/span><strong>Predicting the Future in RL: Proposed Methods<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p id=\"p-rc_cceda7b4a1889c92-188\">Researchers have proposed several ways to make Reinforcement Learning handle action sequences<sup><\/sup>. Here is a quick look at how the community is tackling the problem.<\/p>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-4&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-4-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-4\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\"><strong>Hierarchical Reinforcement Learning (HRL) &amp; Skill-Based Methods:<\/strong> <\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-4\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-4-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p>Proposed in papers like <em>OPAL<\/em> (Ajay et al., 2021) and <em>SUPE-GT<\/em> (Wilcoxson et al., 2024). You use two policies instead of one. A high-level policy picks a broad goal, and a low-level policy executes the multi-step sequence to get there. <br><em>The catch:<\/em> Training both at the same time is highly unstable.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-5&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-5-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-5\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\"><strong>Factorized Critics<\/strong><\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-5\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-5-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p> Proposed in <em>Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning<\/em> (Seo &amp; Abbeel, 2024). This method chops the continuous action space into discrete bins and refines the sequence step by step. <br><em>The catch:<\/em> Forcing fluid motion into rigid boxes limits how flexible the robot can be.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-6&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-6-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-6\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\">Movement Primitives<\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-6\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-6-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p>Explored in <em>TOP-ERL<\/em> (Li et al., 2025). The AI predicts the parameters for a pre-defined mathematical curve rather than raw motor commands. <br><em>The catch:<\/em> It produces smooth motion but lacks real-time adaptability for messy tasks.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-7&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-7-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-7\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\"><strong>Multi-Step Latent Space Planning<\/strong><\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-7\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-7-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p id=\"block-925e7e3d-f2a9-4980-9817-2c7e74ed0f75\">Seen in <em>Value Prediction Network<\/em> (Oh et al., 2017) and MuZero (Schrittwieser et al., 2020). The AI builds an internal simulation of the world and tests action sequences in its head before moving.<br><em>The catch:<\/em> These mental simulations are computationally heavy and hard to use with offline data.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div data-wp-context=\"{ &quot;autoclose&quot;: false, &quot;accordionItems&quot;: [] }\" data-wp-interactive=\"core\/accordion\" role=\"group\" class=\"wp-block-accordion is-layout-flow wp-block-accordion-is-layout-flow\">\n<div data-wp-class--is-open=\"state.isOpen\" data-wp-context=\"{ &quot;id&quot;: &quot;accordion-item-8&quot;, &quot;openByDefault&quot;: false }\" data-wp-init=\"callbacks.initAccordionItems\" data-wp-on-window--hashchange=\"callbacks.hashChange\" class=\"wp-block-accordion-item is-layout-flow wp-block-accordion-item-is-layout-flow\">\n<h4 class=\"wp-block-accordion-heading\"><button aria-expanded=\"false\" aria-controls=\"accordion-item-8-panel\" data-wp-bind--aria-expanded=\"state.isOpen\" data-wp-on--click=\"actions.toggle\" data-wp-on--keydown=\"actions.handleKeyDown\" id=\"accordion-item-8\" type=\"button\" class=\"wp-block-accordion-heading__toggle\"><span class=\"wp-block-accordion-heading__toggle-title\">Q-Chunking<\/span><span class=\"wp-block-accordion-heading__toggle-icon\" aria-hidden=\"true\">+<\/span><\/button><\/h4>\n\n\n\n<div inert aria-labelledby=\"accordion-item-8\" data-wp-bind--inert=\"!state.isOpen\" id=\"accordion-item-8-panel\" role=\"region\" class=\"wp-block-accordion-panel is-layout-flow wp-block-accordion-panel-is-layout-flow\">\n<p>Introduced in <em>Reinforcement Learning with Action Chunking<\/em> (Li et al., 2026). It evaluates raw action sequences directly and uses expressive flow-matching models to handle complex data distributions. <br><em>The catch:<\/em> It introduces additional computational costs associated with best-of-N sampling during operation.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_Current_Landscape\"><\/span>The Current Landscape<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p id=\"p-rc_c5ba560ff95aaf8a-89\">Every proposed method requires a compromise. You have to sacrifice stability with HRL, flexibility with Factorized Critics and Movement Primitives, efficiency with Latent Planning, or take on added computational sampling costs with Q-Chunking<sup><\/sup>. The robotics community is actively testing all of these approaches to find the best balance for real-world deployments.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_Bottlenecks_and_the_Future\"><\/span>The Bottlenecks and the Future<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>We have established that merging diffusion models, reinforcement learning, and action chunking creates a highly capable robotic brain. But we are not quite ready to put humanoid robots in every home.<\/p>\n\n\n\n<p id=\"p-rc_9b56ed931f0125f0-210\">The biggest roadblock right now is <strong>speed<\/strong>. Diffusion models are inherently slow because turning random noise into a crisp action plan requires dozens of mathematical steps. While action chunking buys some time by planning ahead, running these heavy computations natively on a physical robot is taxing. Furthermore, methods like Q-Chunking solve the RL exploration problem by evaluating multiple possible futures at once, which adds a significant computational burden. Finally, there is the <strong>reaction problem<\/strong>. Chunking is fantastic for smooth, predictable tasks like folding clothes, but it is terrible for catching a falling glass. Even though the real-world deployment uses receding horizon and only plays out a part of the plan, this approach still doesn&#8217;t fully fit the reinforcement learning approaches.<\/p>\n\n\n\n<p id=\"p-rc_9b56ed931f0125f0-211\">While I can&#8217;t predict the future (unlike the dynamic latent models) I can only guess the community will be tackling these exact problems in the foreseeable future. To solve the speed issue, researchers are pushing towards faster math, exploring techniques like Consistency Models that aim to compress the long denoising process down to a single leap. To fix the reaction problem, a major milestone will be creating models that automatically determine their own chunk boundaries rather than relying on manual guessing and tuning. The AI will learn to predict long sequences when the coast is clear, and instantly switch to short, rapid predictions when the environment gets chaotic.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>To wrap things up, we are looking at a fundamental shift in how we build brains for robots. For years, robot learning was stuck trying to squeeze messy, unpredictable real-world data into strict, deterministic, one-step boxes. The result was often an AI that compromised, averaged out the best solutions, and ended up crashing the car. Diffusion policies, paired with action chunking, flip that entire script.<\/p>\n\n\n\n<p>By moving from rigid, single-step control to generative, sequence-based behavior, we are teaching robots to navigate the physical world the same way image generators navigate pixels. They no longer average their options or stumble forward millisecond by millisecond. Instead, they confidently pick one valid path and commit to a cohesive, multi-step plan, smoothly filtering out the noise along the way. We still have to iron out the latency and compute issues, but the resulting leap in physical dexterity is undeniable. The generative AI revolution has finally grown hands, and it is going to be exciting to see what it builds next.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous post, we explored the rise of diffusion models in generative AI. The main takeaway was that instead of acting as simple &#8220;expected value&#8221; predictors, like classic neural networks do, diffusion models can map out complex, arbitrary distributions. This mathematical property is exactly why they are able to generate crisp, detailed images rather [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[13],"tags":[20,22,21],"class_list":["post-607","post","type-post","status-publish","format-standard","hentry","category-blog","tag-diffusion-models","tag-generative-ai","tag-neural-networks"],"_links":{"self":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts\/607","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/comments?post=607"}],"version-history":[{"count":25,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts\/607\/revisions"}],"predecessor-version":[{"id":694,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts\/607\/revisions\/694"}],"wp:attachment":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/media?parent=607"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/categories?post=607"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/tags?post=607"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}