{"id":572,"date":"2026-04-17T16:19:49","date_gmt":"2026-04-17T16:19:49","guid":{"rendered":"https:\/\/florentinudrea.net\/?p=572"},"modified":"2026-04-17T16:22:53","modified_gmt":"2026-04-17T16:22:53","slug":"diffusion-models-the-secret-sauce-of-generative-ai","status":"publish","type":"post","link":"https:\/\/florentinudrea.net\/index.php\/2026\/04\/17\/diffusion-models-the-secret-sauce-of-generative-ai\/","title":{"rendered":"Diffusion Models: The Secret Sauce of Generative AI"},"content":{"rendered":"\n<p>You\u2019ve probably seen the absolute explosion of AI-generated art over the last few years. Whether it\u2019s a hyper-realistic image of a dog riding a skateboard on Mars or a stunning digital painting: the internet is flooded with this stuff (especially my Instagram reels). The engine under the hood of all these incredible image generators (like Midjourney, DALL-E, Stable Diffusion and more recently NanoBanana) is something called a <strong>diffusion model<\/strong>. They work, quite literally, by taking an image of full of noise and slowly &#8220;denoising&#8221; it step-by-step until the image appears.<\/p>\n\n\n\n<p>But why did diffusion completely take over the image generation world? The secret sauce is in how it handles complex distributions. Older image generators were tricky to train and often produced blurry or weirdly distorted results when asked to do something complicated. By design, classic neural networks are <em>expected value<\/em> <em>approximators<\/em>: they learn the mean of the distribution they are trying to model. This is why previously, when a face was image-generated, it looked very generic, almost like it was a blend of all the face images it has seen: <strong>that\u2019s because it was<\/strong>, outputting the mean of all faces it has seen it\u2019s the way to minimize the global loss during training.<\/p>\n\n\n\n<p>Diffusion models, which work by taking random noise and slowly &#8220;denoising&#8221; it step-by-step, changed the game. They are stable to train, can capture fine detail, and (most importantly) <strong>they handle multiple valid options<\/strong>. If you ask a diffusion model for a picture of a house, it doesn&#8217;t just mathematically average every house it has ever seen into one blurry, generic blob. It commits to <em>one<\/em> distinct, high-quality sample.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"400\" src=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-7.png\" alt=\"\" class=\"wp-image-600\" srcset=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-7.png 800w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-7-300x150.png 300w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-7-768x384.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption class=\"wp-element-caption\">Diffusion process. Source geeksforgeeks.org<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Unpacking the Core Concept: What is a Diffusion Neural network?<\/h2>\n\n\n\n<p>To move from the specific application in robotics back to the broader architecture, we have to look at why <strong>Diffusion Neural Networks<\/strong> are fundamentally different from the &#8220;input-in, output-out&#8221; models we\u2019ve used previously. In a standard deep learning model, the goal is to find a direct mapping function. If you want to generate a specific output, the network tries to jump from the input to the final result in a single forward pass. This usually works for one-to-one tasks, like classification (where there same input should lead to the same output: an image of a flower will\/should always be classified as a flower). In one-to-many problems, like <strong>Generative AI<\/strong>, where &#8216;generating a flower&#8217; has multiple correct outputs, the mapping mechanism of standard deep learning models falls apart.<\/p>\n\n\n\n<p><strong>The Score-Based Perspective<\/strong> Diffusion models change the objective from <strong>mapping<\/strong> to <strong>gradient following<\/strong>. Instead of learning what a &#8220;perfect&#8221; data point looks like, the network learns the <strong>score function,<\/strong> the gradient of the log-probability of the data.<\/p>\n\n\n\n<p>Think of the data manifold (the collection of all valid, high-quality outcomes) as a series of mountain peaks. In this metaphor, the higher the elevation, the more &#8220;realistic&#8221; or &#8220;correct&#8221; the data is. The valleys below represent the infinite ways to produce noise or nonsensical results. <strong>Traditional Models<\/strong> tries to &#8220;teleport&#8221; from an input directly onto a peak. However, if the training data contains two different peaks (like a car turning left and a car turning right), the model\u2019s math forces it to aim for the &#8220;average&#8221; coordinates between them. It tries to land in the middle of the two mountains, only to realize there is no ground there. <strong>Diffusion Models<\/strong> don&#8217;t try to leap to the summit in one go. Instead, it learns the <strong>score function<\/strong>: a mathematical &#8220;map of the slopes.&#8221; It studies the entire landscape and learns which way is &#8220;up&#8221; from any given point in the fog.<\/p>\n\n\n\n<p><strong>Handling the &#8220;One-to-Many&#8221; Problem<\/strong> The most significant advantage of this architecture is its inherent ability to handle <strong>one-to-many<\/strong> relationships. In standard regression, if the input <math data-latex=\"X\"><semantics><mi>X<\/mi><annotation encoding=\"application\/x-tex\">X<\/annotation><\/semantics><\/math> could result in either <math data-latex=\"Y_1\"><semantics><msub><mi>Y<\/mi><mn>1<\/mn><\/msub><annotation encoding=\"application\/x-tex\">Y_1<\/annotation><\/semantics><\/math> or <math data-latex=\"Y_2\"><semantics><msub><mi>Y<\/mi><mn>2<\/mn><\/msub><annotation encoding=\"application\/x-tex\">Y_2<\/annotation><\/semantics><\/math>, the model&#8217;s loss function (like Mean Squared Error) will force it to predict <math data-latex=\"\\frac{Y_1+Y_2}{2}\"><semantics><mfrac><mrow><msub><mi>Y<\/mi><mn>1<\/mn><\/msub><mo>+<\/mo><msub><mi>Y<\/mi><mn>2<\/mn><\/msub><\/mrow><mn>2<\/mn><\/mfrac><annotation encoding=\"application\/x-tex\">\\frac{Y_1+Y_2}{2}<\/annotation><\/semantics><\/math>. Diffusion networks avoid this because they are <strong>probabilistic samplers<\/strong>. They don&#8217;t predict a value; they model a distribution. By starting from a different noise each time, the same network can walk its way up different mountains (valid regions of the data). It treats the complexity of the real world not as noise to be averaged out, but as a landscape of valid peaks to be explored.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/federicosarrocco.com\/images\/flow-matching\/flow_matching.gif\" alt=\"Example of the diffusion process going towards the targets (peaks in our analogy).\" style=\"width:603px;height:auto\"\/><figcaption class=\"wp-element-caption\">Example of the diffusion process going towards the targets (peaks in our analogy).<\/figcaption><\/figure>\n<\/div>\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Speeding up Diffusion Models<\/h2>\n\n\n\n<h4 class=\"wp-block-heading\">Denoising Diffusion Probabilistic Models (DDPMs): The Stochastic Random Walk<\/h4>\n\n\n\n<p>DDPMs are the architecture that started the current generative AI boom. They operate as a <strong>Stochastic Differential Equation (SDE)<\/strong>.<\/p>\n\n\n\n<p>In a DDPM, the forward process destroys an image by injecting Gaussian noise step-by-step. The neural network is trained to estimate the &#8220;score&#8221; (the gradient of the log probability) to reverse this process.<\/p>\n\n\n\n<p>Because the noise injection is random (stochastic), <strong>the mathematical path from a clean image to pure noise is<\/strong> essentially Brownian motion, it is <strong>jagged, curved, and highly chaotic<\/strong>. When the neural network tries to reverse this process, it is forced to navigate this curved path.<\/p>\n\n\n\n<p>If the mathematical &#8220;step size&#8221; it takes is too large, it will overshoot the curve and step off the data manifold, resulting in a distorted, unrecognizable image. Therefore, <strong>the model is trapped into taking hundreds or thousands of tiny steps<\/strong> <strong>(neural network evaluations)<\/strong> to safely traverse the curved probability space. This is what makes traditional diffusion so computationally expensive and slow.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Flow Matching: Deterministic Vector Fields<\/h4>\n\n\n\n<p>Flow Matching abandons the stochastic random walk entirely. Instead, it frames the generation process as an <strong>Ordinary Differential Equation (ODE)<\/strong>. It treats the noise and the target data not as a sequence of corrupted images, but as two distinct distributions of particles.<\/p>\n\n\n\n<p>Imagine a solid object made of sand (the data distribution). Now imagine you put a tiny explosive at the middle of it and detonate it (noising): it will become a cloud of sand grains (the Gaussian noise distribution). When accounting for air turbulence, the direction in which each grain of sand was blown away is noisy and unpredictable. On the other hand, if you ignore the turbulence of the air, <strong>every grain of sand has a distinct direction<\/strong> in which it was blown away. That is the main difference between diffusion models and flow matching models: the former model noisy and stochastic paths, the latter model clear paths.<\/p>\n\n\n\n<p>Flow Matching trains a neural network to act as a <strong>Velocity Vector Field<\/strong>, denoted as <math data-latex=\"v_t(x)\"><semantics><mrow><msub><mi>v<\/mi><mi>t<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">v_t(x)<\/annotation><\/semantics><\/math>. The network does not try to guess &#8220;how much noise to remove.&#8221; Instead, it looks at a particle at position <math data-latex=\"x\"><semantics><mi>x<\/mi><annotation encoding=\"application\/x-tex\">x<\/annotation><\/semantics><\/math> at time <math data-latex=\"t\"><semantics><mi>t<\/mi><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> and outputs a physical velocity vector: <em>&#8220;Move this particle exactly in this direction, at this speed.&#8221;<\/em><\/p>\n\n\n\n<p>Because Flow Matching uses ODEs, the trajectory from noise to data is <strong>deterministic<\/strong>, not random. More importantly, Flow Matching allows researchers to explicitly <em>design<\/em> the probability path. Instead of being forced to follow the chaotic curves of a DDPM, engineers can define simple, direct flows. The neural network simply learns to match this optimal vector field.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"454\" src=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-6-1024x454.png\" alt=\"\" class=\"wp-image-595\" srcset=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-6-1024x454.png 1024w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-6-300x133.png 300w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-6-768x341.png 768w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-6-1536x681.png 1536w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-6-2048x908.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Flow Matching vs Diffusion.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 570 \/ 360;\" width=\"570\" controls src=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/flowvsdiff_video.mp4\"><\/video><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/x.com\/alec_helbling\">@alec_helbling<\/a> on X (<a href=\"https:\/\/x.com\/alec_helbling\/status\/1924451851316932758\">https:\/\/x.com\/alec_helbling\/status\/1924451851316932758<\/a>)<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Rectified Flows: From Paths to Highways<\/h4>\n\n\n\n<p>If Flow Matching is the general framework for mapping one distribution to another using vector fields, <strong>Rectified Flows<\/strong> (and a closely related concept called Optimal Transport) are a specific, highly optimized application of that framework.<\/p>\n\n\n\n<p>The core idea behind rectified flows is that <strong>the shortest distance between two points is a straight line<\/strong>. Rectified Flows force the paths between the noise particles and the data particles to be as straight and untangled as mathematically possible. If a vector field path is highly curved, you must take many tiny computational steps to follow it accurately. <strong>If you take a large step on a curved path, you shoot off into empty space<\/strong> (resulting in a distorted image).<\/p>\n\n\n\n<p>Because the paths are enforced to be straight lines, <strong>the AI can take massive computational steps<\/strong> (using simple Euler integration) <strong>without flying off the path<\/strong>. This allows models like Stable Diffusion 3 (which uses Rectified Flows) to generate high-quality images in just 1 to 4 steps instead of 50.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"720\" style=\"aspect-ratio: 1152 \/ 720;\" width=\"1152\" controls src=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/Alec_Helbling_-_Flow-based_generative_models_trained_with_flow_matching_tend_to_learn_SVem7U.mp4\"><\/video><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/x.com\/alec_helbling\">@alec_helbling<\/a> on X (https:\/\/x.com\/alec_helbling\/status\/2012526645815476625)<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Consistency Models: Skipping the Hike<\/strong><\/h4>\n\n\n\n<p>Consistency Models take a completely different approach to speeding up the inference. Instead of trying to build a faster path or a straighter path, <strong>they try to build a teleporter<\/strong>. If you are anywhere on a specific path from noise to data, <strong>you shouldn&#8217;t have to follow the rest of the path to know where it ends<\/strong>.<\/p>\n\n\n\n<p>The model is trained to enforce a strict mathematical rule: <math data-latex=\"f(x_t, t) = f(x_{t'}, t')\"><semantics><mrow><mi>f<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mi>f<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>x<\/mi><msup><mi>t<\/mi><mo lspace=\"0em\" rspace=\"0em\" class=\"tml-prime\">\u2032<\/mo><\/msup><\/msub><mo separator=\"true\">,<\/mo><msup><mi>t<\/mi><mo lspace=\"0em\" rspace=\"0em\" class=\"tml-prime\">\u2032<\/mo><\/msup><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">f(x_t, t) = f(x_{t&#8217;}, t&#8217;)<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Let <math data-latex=\"x\"><semantics><mi>x<\/mi><annotation encoding=\"application\/x-tex\">x<\/annotation><\/semantics><\/math> be an image, and <math data-latex=\"x\"><semantics><mi>x<\/mi><annotation encoding=\"application\/x-tex\">x<\/annotation><\/semantics><\/math> be the time-step (or noise level). <\/li>\n\n\n\n<li>If you are at time <math data-latex=\"t\"><semantics><mi>t<\/mi><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> (very noisy) and time <math data-latex=\"t'\"><semantics><msup><mi>t<\/mi><mo lspace=\"0em\" rspace=\"0em\" class=\"tml-prime\">\u2032<\/mo><\/msup><annotation encoding=\"application\/x-tex\">t&#8217;<\/annotation><\/semantics><\/math> (slightly noisy) <em>on the exact same trajectory<\/em>, the network must output the exact same clean image (<math data-latex=\"x_0\"><semantics><msub><mi>x<\/mi><mn>0<\/mn><\/msub><annotation encoding=\"application\/x-tex\">x_0<\/annotation><\/semantics><\/math>).<\/li>\n<\/ul>\n\n\n\n<p>During training, the system looks at two adjacent points on a diffusion trajectory. It penalizes the neural network if it predicts a different final image for point A than it does for point B. The result ? Once trained, <strong>you can hand the model pure noise<\/strong> (time <math data-latex=\"T\"><semantics><mi>T<\/mi><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math>), and it will evaluate the consistency function to <strong>instantly spit out the clean image<\/strong> (time <math data-latex=\"0\"><semantics><mn>0<\/mn><annotation encoding=\"application\/x-tex\">0<\/annotation><\/semantics><\/math>) in a <strong>single step<\/strong>. However, if you want higher quality, you can still chain a few steps together (e.g., jump from <math data-latex=\"T\"><semantics><mi>T<\/mi><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math> to <math data-latex=\"T\/2\"><semantics><mrow><mi>T<\/mi><mi>\/<\/mi><mn>2<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">T\/2<\/annotation><\/semantics><\/math>, inject a tiny bit of noise, then jump to <math data-latex=\"0\"><semantics><mn>0<\/mn><annotation encoding=\"application\/x-tex\">0<\/annotation><\/semantics><\/math>) to refine the details.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Latent Diffusion: Shrinking the Universe<\/strong><\/h4>\n\n\n\n<p>If standard diffusion models are the engine of generative AI, Latent Diffusion Models (LDMs) are what made that engine small enough to actually fit in your garage. <strong>Denoising a massive, high-resolution image pixel-by-pixel is computationally agonizing<\/strong>. LDMs bypass this by doing the heavy lifting in a compressed mathematical pocket dimension. If a neural network has to calculate the noise for every single pixel in a 4K image across 50 to 100 denoising steps, <strong>it requires a server farm of GPUs running at maximum capacity<\/strong>. It is simply too much data to process efficiently.<\/p>\n\n\n\n<p>Instead of working directly in pixel space, the system uses an autoencoder to compress the image into a <strong>tiny, highly dense mathematical representation called the &#8220;latent space&#8221;.<\/strong> All of the noise injection and step-by-step denoising happens entirely within this compressed space.<\/p>\n\n\n\n<p>Only at the very end, once the latent representation is perfectly clean, does a decoder blow it back up into the pixels we actually see on our screens. This clever architectural bypass reduced the compute cost so drastically that it allowed massive models (like Stable Diffusion) to run on standard consumer graphics cards, effectively democratizing generative AI overnight.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"767\" height=\"171\" src=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-1.png\" alt=\"\" class=\"wp-image-581\" srcset=\"https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-1.png 767w, https:\/\/florentinudrea.net\/wp-content\/uploads\/2026\/04\/image-1-300x67.png 300w\" sizes=\"auto, (max-width: 767px) 100vw, 767px\" \/><\/figure>\n<\/div>\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Beyond the Canvas: Diffusion Models Applications<\/h2>\n\n\n\n<p>The true power of the diffusion architecture lies in a simple, profound realization: in data science, &#8220;noise&#8221; is not just a grid of static pixels: noise can be anything. <strong>If you can represent a complex real-world concept as numbers, you can corrupt it with noise, and more importantly, you can train a diffusion model to clean it up<\/strong>. This makes diffusion a universal engine for mapping chaos back to structure, building directly on the foundational mathematics of nonequilibrium thermodynamics introduced in early denoising diffusion probabilistic models.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Escaping the 2D Canvas (Video and 3D)<\/strong><\/h4>\n\n\n\n<p>The immediate next step was adding the dimensions of time and depth. When generating video, models like OpenAI&#8217;s Sora do not just generate a sequence of independent images, as that would result in a flickering, morphing mess. Instead, they treat the video as a spatiotemporal volume of noise encompassing width, height, and time. The model denoises the entire block simultaneously, ensuring that a character&#8217;s jacket stays the exact same shape and color from the first frame to the very last (<a href=\"https:\/\/openai.com\/index\/video-generation-models-as-world-simulators\/\">https:\/\/openai.com\/index\/video-generation-models-as-world-simulators\/<\/a>).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Painting with Soundwaves<\/strong><\/h4>\n\n\n\n<p>The exact same math used to draw a picture can be used to compose a symphony. Instead of denoising a photograph, audio diffusion frameworks like AudioLDM denoise a spectrogram, which is a visual representation of sound frequencies over time. By starting with visual static and slowly pulling out the distinct frequency bands, the model can generate high-fidelity sound effects, hyper-realistic text-to-speech, or fully arranged musical tracks (<a href=\"https:\/\/arxiv.org\/abs\/2301.12503\">https:\/\/arxiv.org\/abs\/2301.12503<\/a>).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>The Scientific Frontier: Health and Material Science<\/strong><\/h4>\n\n\n\n<p>This is where the technology leaves the media realm entirely and becomes a tool for scientific discovery. In biology, researchers are using models like RFdiffusion to invent entirely novel proteins that do not exist in nature. Instead of looking at pixels, the model looks at the 3D spatial coordinates of amino acids. It starts with a chaotic cloud of atomic noise and slowly denoises it into a stable, highly complex protein structure specifically designed to bind to a target, like a cancer cell or a virus (<a href=\"https:\/\/www.biorxiv.org\/content\/10.1101\/2022.12.09.519842v1\">https:\/\/www.biorxiv.org\/content\/10.1101\/2022.12.09.519842v1<\/a>). We are effectively using diffusion to print custom medicine.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Discrete Text<\/strong><\/h4>\n\n\n\n<p>Right now, autoregressive Large Language Models like ChatGPT rule the text generation world. They work by predicting the next word sequentially from left to right. But diffusion is coming for the throne. Researchers have been developing discrete diffusion models for text, such as Diffusion-LM. Instead of predicting one word at a time, these models start with an entire paragraph of masked, gibberish tokens and denoise the entire text at the exact same time (<a href=\"https:\/\/arxiv.org\/abs\/2205.14217\">https:\/\/arxiv.org\/abs\/2205.14217<\/a>). This allows the AI to perform complex, out-of-order reasoning and bidirectional editing, which is proving to be a massive leap for tasks like coding and logic.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>The Physical World<\/strong><\/h4>\n\n\n\n<p>Finally, if you can denoise an image, a soundwave, or a protein molecule, you can denoise physical motion. Robots do not output pixels; they output physical trajectories, including a sequence of motor commands, joint angles, and velocities over time. The conceptual leap here is incredibly elegant: we simply swap out the image for an action chunk. By starting with random, jagged motor noise, a diffusion policy can iteratively smooth it out until it becomes a highly precise, physically viable path for a robotic arm to follow (<a href=\"https:\/\/arxiv.org\/abs\/2303.04137\">https:\/\/arxiv.org\/abs\/2303.04137<\/a>).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Challanges and limitations<\/h2>\n\n\n\n<p>Of course, no architecture is perfect, and diffusion models come with their own set of headaches. The most glaring issue is <strong>efficiency<\/strong>. Even with the &#8220;highways&#8221; of Rectified Flows, these models are computationally &#8220;heavy&#8221; compared to their autoregressive cousins (like ChatGPT). Because a diffusion model often has to process the entire output space at once rather than one token or pixel at a time, it struggles with <strong>long-context tasks<\/strong>.<\/p>\n\n\n\n<p>In the world of text, for example, diffusion models still grapple with a &#8220;shallow dependency&#8221; problem, they are great at generating a coherent paragraph but can lose the plot over a long essay because they lack the deep, sequential reasoning of standard Transformers.<\/p>\n\n\n\n<p>There\u2019s also the <strong>controllability gap<\/strong>: while a diffusion model can give you a stunning forest, getting it to put a specific bird on a specific branch with 100% consistency every time is still a work in progress. For professionals in film or design who need &#8220;pixel-perfect&#8221; control, the &#8220;probabilistic&#8221; nature of diffusion can sometimes feel a bit too much like rolling dice.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Diffusion models have evolved from a clever mathematical trick for cleaning up grainy photos into a <strong>universal architect for complex data<\/strong>. By trading the &#8220;one-shot&#8221; guesswork of older models for a patient, step-by-step climb up the probability mountain, they\u2019ve unlocked a level of creativity and precision that was previously the stuff of sci-fi.<\/p>\n\n\n\n<p>Whether it\u2019s shrinking the compute universe with <strong>Latent Diffusion<\/strong>, streamlining the path with <strong>Rectified Flows<\/strong>, or &#8220;teleporting&#8221; to the finish line with <strong>Consistency Models<\/strong>, the technology is moving at incredible speed. We are no longer just teaching machines to recognize our world; we are giving them a map of the &#8220;slopes&#8221; so they can help us build entirely new ones, from the pixels on your feed to the proteins in your medicine and the motions of the robots in our future.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>You\u2019ve probably seen the absolute explosion of AI-generated art over the last few years. Whether it\u2019s a hyper-realistic image of a dog riding a skateboard on Mars or a stunning digital painting: the internet is flooded with this stuff (especially my Instagram reels). The engine under the hood of all these incredible image generators (like [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[13],"tags":[20,22,21],"class_list":["post-572","post","type-post","status-publish","format-standard","hentry","category-blog","tag-diffusion-models","tag-generative-ai","tag-neural-networks"],"_links":{"self":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts\/572","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/comments?post=572"}],"version-history":[{"count":16,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts\/572\/revisions"}],"predecessor-version":[{"id":604,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/posts\/572\/revisions\/604"}],"wp:attachment":[{"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/media?parent=572"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/categories?post=572"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/florentinudrea.net\/index.php\/wp-json\/wp\/v2\/tags?post=572"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}