I have been spending a lot of time on one-step generative models lately, MeanFlow in particular and the family it belongs to. This post is how I now think about them: how they work, where the math comes from, and how each one answers a limitation of the one before it.
Diffusion and flow models produce images, audio, and video that are nearly indistinguishable from real data, but a single sample takes hundreds of sequential network evaluations. A 1024×1024 image from SDXL at 50 steps takes several seconds on an A100, while a one-step model produces the same image in tens of milliseconds. That is fine for offline synthesis but unusable for anything interactive, and most of that gap has closed in the last two years.
Working backwards from the goal: what does a network need to learn to generate in one step?
The post leans on a small set of symbols, collected here for reference. Each is also defined where it first appears, so you can skip this and refer back.
| Symbol | Meaning |
|---|---|
| $x_0,\ x_1$ | a clean data sample and a Gaussian noise sample, the two ends of the trajectory |
| $t$ | how noisy the sample is, from $t=0$ (clean data) to $t=1$ (pure noise) |
| $s,\ r$ | other time values, the destination of a jump (a cleaner level than $t$) |
| $z_t$ | the partly-noised sample at time $t$, somewhere between $x_0$ and $x_1$ (likewise $z_s$, $z_r$ at other times) |
| $v(z_t, t)$ | velocity: the direction and speed to move at $z_t$ to follow the trajectory |
| $\bar{u}(z_t, r, t)$ | average velocity from $t$ to $r$: the displacement divided by the elapsed time |
| $\theta$ | the trained network's weights; $f_\theta$ is a network with weights $\theta$, and $\theta^-$ is a slow-moving copy of them |
Generation as transport
Generating a sample (say, an image of a dog) means moving probability mass from a noise distribution to the data distribution. This transport can be stochastic (a forward and reverse diffusion SDE) or deterministic (an ODE); every method in this post lives in the deterministic regime, on a path called the probability flow ODE (PF-ODE) [11].1 Parameterise it by time $t$ running from $t=1$ (pure Gaussian noise) to $t=0$ (clean data). Write $z_t$ for the point on the trajectory at time $t$, and $x_0$ for the clean sample at $t=0$. Because it is deterministic, the path can be run forwards or backwards exactly.
Think of a soap bubble. Press it slowly and the surface deforms; every point on the film follows a smooth path. Release the pressure and it snaps back along the exact same path. The PF-ODE is that elastic surface in probability space, and $z_t$ is one spot on the surface at time $t$.
Standard diffusion and flow models learn to estimate the PF-ODE velocity locally, then numerically solve the learned ODE step by step from noise to data. At each $z_t$ the network gives the speed and direction to move to keep heading toward the dog (or whatever target the trajectory points to). Apply that, take a step, repeat.
The core problem with step-by-step integration is that the local velocity at $z_t$ says nothing about where the trajectory ends up globally. The flow has to be followed closely, one small step at a time, or it drifts off course and ends up somewhere wrong. Following it that closely is what makes it expensive.
Flow matching
Flow matching [1] is the foundation every one-step method builds on or borrows from. It learns a continuous-time flow that moves probability mass from one distribution to another. The source and destination can be any two distributions; unlike diffusion, which fixes the noisy end to Gaussian noise by construction, flow matching has no such constraint. The simplest useful case uses Gaussian noise as the source, which gives straight-line paths between noise and data:2
The velocity along this path is constant: $v = \frac{dz_t}{dt} = x_1 - x_0$. It can be read directly off a training pair $(x_0, x_1)$, with no integration or simulation. A neural network $v_\theta(z_t, t)$ is then trained to predict it at every $(z_t, t)$ by minimising the regression loss $\lVert v_\theta(z_t, t) - (x_1 - x_0) \rVert^2$, which is ordinary supervised learning against a known target.
Inference is still slow, though. Each individual path $x_0 \leftrightarrow x_1$ is a straight line, but the marginal velocity field is not. At any given noisy image $z_t$, many different clean images $x_0$ are plausible, and each candidate has its own straight-line velocity pointing in a slightly different direction. The network has to output the probability-weighted average of all those directions, which traces a curved path through image space. Following a curved path with only local velocity information takes many small steps.
Scrub the demo below to watch this happen: at $t=1$ the field points toward the centroid (no cluster has been chosen), and as $t$ decreases the weights concentrate and particles fan out toward different clusters. Try the candidates buttons (1, 3, 5) to see how a single cluster gives a uniform field while multiple clusters force curvature.
1 destination: field is uniform everywhere. A single step from any noise point lands exactly at x₀.
So flow matching has a simple training objective and slow inference, because the marginal field it learns is curved and has to be followed in small steps. The rest of this post is about recovering fast inference without giving up that training simplicity. There are two approaches:
- Jump to the endpoint directly. Learn a function that maps any trajectory point straight to $x_0$.
- Jump to any point, not just the endpoint. Learn a two-time function that goes from any $t$ to any $s < t$ in one step. This is the flow map idea, and the more general goal most of this post is about.
Consistency models
Consistency models [2] were the first method to generate competitive samples in a single step. Instead of learning a velocity and integrating it, they learn the endpoint map directly, a function $f_\theta$ that sends any point on a trajectory to that trajectory’s clean endpoint $x_0$:
Applied once from pure noise, this lands directly on a clean image. Drawn along one trajectory:
same trajectory: noise side clean side z_1 ─────── z_t ─────── z_s ─────── x_0 f_θ(z_1, 1) → x_0 f_θ(z_t, t) → x_0 f_θ(z_s, s) → x_0 f_θ(x_0, 0) → x_0
For this to work, the function needs two properties. The boundary condition: at $t = 0$ (or, more precisely, a small cutoff $\varepsilon$ near zero) the function must be the identity, $f_\theta(x_0, \varepsilon) = x_0$. A clean image maps to itself. Without this anchor the network can satisfy the rest of the loss by collapsing to a constant.
The consistency condition: any two points on the same PF-ODE trajectory must map to the same $x_0$. Two different noisy versions of one clean image, fed through the network, should produce identical outputs. Without it the function is only pinned down locally and never becomes globally coherent. The figure below shows this: every point along one trajectory maps to the same destination.
How is $f_\theta$ built so the boundary identity holds? The naive way is piecewise: set $f_\theta(x, t) = x$ at $t = \varepsilon$ and $f_\theta(x, t) = F_\theta(x, t)$ otherwise, with $F_\theta$ a free neural network. That works for the discrete-time loss but breaks under continuous-time training, since the function is no longer differentiable at $\varepsilon$.
The solution, which has stuck since, is to wire the boundary identity into the architecture with a skip / output split:
At $t = \varepsilon$ the formula collapses to $f_\theta(x, \varepsilon) = 1 \cdot x + 0 \cdot F_\theta = x$, so the boundary condition holds by construction and the loss never has to enforce it. And since $c_\text{skip}$, $c_\text{out}$, and $F_\theta$ are all differentiable in $t$, so is $f_\theta$, which is exactly what continuous-time training needs.
The specific functional forms for $c_\text{skip}$ and $c_\text{out}$ are inherited directly from the EDM (elucidating diffusion models) preconditioning [10].3
The two schedules cross as $t \to \varepsilon$: the skip weight $c_{\mathrm{skip}}(t)$ rises toward one while the trunk weight $c_{\mathrm{out}}(t)$ falls to zero, leaving $f_\theta \approx z_t$ at the clean end.
Enforcing consistency via self-distillation
Which trajectory a given $z_t$ belongs to is never directly observed, so the $(z_t, z_s)$ pairs that should agree cannot be enumerated. The workaround is to take two adjacent points on the same trajectory, $z_t$ and $z_{t-\Delta}$ a small step apart, and require their predictions to match:
The EMA copy $\theta^-$ updates slowly, typically $m \approx 0.99$, so the target moves at about 1% of the main network’s speed. If both sides of the loss moved together, they could chase each other into the constant solution from before; the stop-gradient breaks that symmetry, so gradients flow only through the left side and $\theta$ chases a target that drifts slowly underneath it. Same target network trick that stabilised DQN: when a network regresses toward a target derived from itself, the target has to stay still enough to aim at.
Getting the adjacent point $z_{t-\Delta}$ requires one step of the PF-ODE, and this is where the teacher/student framing becomes explicit. In consistency distillation (CD), a pretrained diffusion model is the teacher: it provides reliable one-step ODE moves onto the true trajectory, and the student learns to jump straight to $x_0$ from any point on them. In consistency training (CT) there is no teacher; the network produces its own one-step moves, which adds noise and makes training harder to stabilise. CD converges faster and produces better results; CT drops the teacher dependency but needs more careful engineering.
A note on language, since three different things in this post get called “teacher.” Data supervision is clean targets read straight from training pairs, like flow matching’s $v = x_1 - x_0$. A pretrained teacher is a separately trained external network, the kind used in CD. The EMA copy $\theta^-$ is the network’s own slow-moving lag of itself. That is exactly how CT and CD differ: CT has only the EMA copy, CD has both. From here on I’ll say “EMA copy” and “pretrained teacher” to keep them apart, since later methods lean on one or the other.
The discretisation curriculum
The choice of $\Delta$ hides a tradeoff. Divide the time axis into $N$ steps, with adjacent training pairs one step apart, so $\Delta = 1/N$.
Small $N$ means a large gap. The two points are far apart on the trajectory, so the signal is strong, but a large ODE step carries large discretisation error and $z_{t-\Delta}$ only approximately lands on the true trajectory. The network is then trained to match a target that is itself a bit wrong.
Large $N$ means a small gap. A tiny ODE step is nearly exact, so the targets are accurate, but the two points are so close that their predictions already agree and the gradient is too small to learn from.
Neither extreme works. The remedy is a curriculum: start small (strong signal, rough targets), then grow $N$ (weak signal, accurate targets), so the network learns a coarse consistency function and refines it. Slide $N$ in the demo below: at $N=1$ the Euler step from $z_t$ falls far off the true curve (large red gap, strong gradient); at $N=8$ it tracks the curve but barely moves the network.
N=1: one step covers the whole trajectory. The tangent step at z_t lands far from the true curve. Large training signal, noisy target.
Discrete-time vs continuous-time
What I described above is the discrete-time formulation: pick a grid of $N$ noise levels, define adjacent pairs on that grid, and run the curriculum on $N$. The grid exists only because the consistency condition cannot be enforced over a continuum, only at sampled pairs of points, and the curriculum on $N$ is there to manage the bias-variance tradeoff the grid introduces.
The continuous-time formulation removes the grid. Differentiating the consistency condition $f(z_t, t) = f(z_{t-\Delta}, t-\Delta)$ as $\Delta \to 0$ gives a PDE-style identity, $\partial_t f + v(z_t, t) \cdot \partial_z f = 0$ along the PF-ODE,4 which the loss enforces at sampled $(z_t, t)$ pairs with no adjacent point and no grid to schedule. sCT and sCD, the simplified continuous-time variants of CT and CD [9], use this and produce sharper results, since the finite-$\Delta$ bias is gone. The cost is a Jacobian-vector product (JVP) to compute $\partial_z f \cdot v$: forward-mode autodiff gives it in one modified forward pass without ever forming the full Jacobian.5 MeanFlow reuses the same machinery later.
Both formulations are hard to train in their basic form. iCT (improved consistency training) [3] closed much of the gap to diffusion with a set of stability fixes: pseudo-Huber losses, a lognormal noise schedule, and progressive discretisation step doubling.6 All of that tuning is needed because of how the training target is defined.
The training target here is always behavioural: it constrains what the network outputs at adjacent pairs of points, not what the underlying field should be. There is no ground truth for $f(z_t, t)$ independent of the network. The optimal function is defined only implicitly, through the consistency and boundary conditions, and the only way to learn it is to have the network agree with itself across adjacent pairs. That is inherently noisy and sensitive to hyperparameters.
A deeper limitation: consistency models are stuck on a single destination, the endpoint $x_0$. The signature is $f(z_t, t) = x_0$: given a point and its time, the network predicts the endpoint and nothing else. There is no way to ask it to jump to an intermediate point, so multi-step generation means running the network repeatedly and renoising between evaluations, a workaround that does not actually use the trajectory structure. But what if the jump function could land anywhere, not just $x_0$?
Consistency trajectory models: any-to-any jumps
CTM [4] generalises consistency models. Where a consistency function always jumps to the end of the PF-ODE trajectory, CTM lets it jump to any point along the way. This is the two-time function:
With this object a sampler can take large or small steps, land at any intermediate point on the trajectory, and compose multiple jumps to refine a generation. For this to be well-posed, the function has to satisfy the semigroup property. A jump from $t$ to some intermediate $u$ followed by a jump from $u$ to $s$ should give the same result as a direct jump from $t$ to $s$:
Drag the split point $u$ in the figure below to see this in action: the direct jump (top arc) and the two-leg composition (bottom arcs) always end at the same destination, wherever the split falls.
Drag the split point. Both routes (direct and two-step) always land at the same destination.
Training enforces this by sampling triples $(r, s, t)$ with $r < s < t$ and comparing the direct jump $G_\theta(z_t, t, r)$ against the composed two-step jump:
The flexible jump function makes multi-step generation more natural than in consistency models: chain calls with progressively smaller target times, no renoising needed. At the time of publication, CTM held the best single-step FID numbers (see the results table below).
Like consistency models, CTM still trains against a self-referential target. $G_{\theta^-}$ is the network evaluated at a slightly lagged copy of itself; there is no ground-truth two-time map independent of the network, and the only supervision comes from the model agreeing with itself across different decompositions. The richer semigroup structure makes this more stable than the consistency-model case, but the self-reference remains. CTM is also complicated in practice: random triples $(r, s, t)$, two separate network evaluations on the target side, and several components to coordinate.
Shortcut models
Shortcut models [5] trade CTM’s continuous triples for a discrete set of jump sizes and get a much simpler procedure in exchange. They condition the network on both the current noise level $t$ and the desired step size $d$, so it predicts where a jump of size $d$ from $z_t$ lands.
A plain flow matching network only knows “I am at noise level $t$.” A shortcut model also knows “and I want to travel a distance $d$ in one step.” That extra input lets it calibrate the prediction to the jump size, so the same network can take a big step or a small one at different points during generation.
The bootstrapping training procedure
How is this trained? The ground-truth $z_{t-d}$ cannot be computed directly without running the full ODE, so the target is built from smaller steps the network can already make: two half-steps from the (stop-gradient) EMA copy are composed into a single full-step target the student matches. The picture first, then the equations.
In equations, this is the semigroup property enforced discretely:
Training bootstraps upward from the smallest steps, where the half-step approximation is most accurate: one-step jumps first, then two-step, then four-step, each size built from the one below it. At inference any step count works, one for speed or many for quality.
Frans et al. draw the same construction with the actual loss notation overlaid:
Shortcut models enforce composition by directly comparing network outputs: no differentiation through the network, no JVP, just a fast per-step update. The tradeoff is that the discrete half-step approximation introduces small errors that compound across many composed steps, and like CTM and consistency models before it, the training target is still self-referential. The next two methods both reduce that self-reference, in opposite ways. Align Your Flow brings in external ground truth from a pretrained teacher; MeanFlow finds internal ground truth, an exact identity that lets the network supervise itself against quantities computable directly from data.
Align Your Flow: distilling the jump function
CTM and Shortcut both train the flow map $G_\theta(z_t, t, s) = z_s$ from scratch, which is what forces the curriculum schedules and EMA copies on the target side. Align Your Flow [8] avoids that by distilling the map from a pretrained diffusion teacher, whose ODE trajectories give the ground truth the from-scratch methods never had.
This fixes a blind spot in CTM. CTM trains the jump function so composed short jumps reproduce long ones, but it only ever checks the jumps against each other, never against a correct answer. The teacher supplies that answer: the student’s jumps are trained to trace real ODE trajectories, so internal consistency now rests on an external anchor instead of standing alone.
The paper also proves consistency models eventually get worse with more steps. Theorem 3.1: for an isotropic Gaussian, there exist consistency models arbitrarily close to optimal in $L_2$ for which adding sampling steps beyond some $N$ increases the Wasserstein-2 distance to the true distribution.7 Fig. 5 of the paper shows it empirically: multi-step CM sampling peaks around 2 steps, then degrades. The mechanism is renoising. CMs jump to clean and reinject Gaussian noise to get back onto the trajectory, and over many steps that noise drifts off the teacher’s PF-ODE, so errors compound.
Flow maps avoid this failure by construction, because they step directly between noise levels and never renoise. The paper does not prove they improve monotonically, but empirically they keep getting better with more steps, exactly where CMs fall apart.
How does the student actually enforce consistency against the teacher? AYF gives two losses, borrowing the fluid-dynamics distinction between Eulerian (fixed observer, watch the field go by) and Lagrangian (move along with the particle) frames. They differ in which time variable they perturb.
| Objective | What's varied | Why it works | Empirical role |
|---|---|---|---|
| EMD (Eulerian Map Distillation) | Endpoint $s$ fixed, perturb starting time $t$; check that $G_\theta(z_t, t, s)$ is invariant as $t$ moves along the teacher trajectory. | Generalises the continuous-time consistency and flow matching losses as the two limits of $s$; the structurally natural object to optimise. | Primary loss in all main results. |
| LMD (Lagrangian Map Distillation) | Starting point $t$ fixed, perturb endpoint $s$; check that $G_\theta(z_t, t, s)$ moves correctly as $s$ slides along the trajectory it predicts. | Uses the teacher's instantaneous velocity at the predicted point, so it stays faithful to the flow geometry the teacher defines. | Used as a stabiliser; on its own produces over-smoothed samples on real images, per the paper's ablations. |
To replace classifier-free guidance during distillation, AYF uses autoguidance: the teacher is mixed with a weaker checkpoint of itself, $v_\phi^{\text{guided}} = \lambda v_\phi + (1 - \lambda) v_\phi^{\text{weak}}$ with $\lambda$ sampled uniformly from $[1, 3]$. This steers samples away from low-quality regions without the overshooting failure mode CFG can have.
Empirically, a small AYF student beats much larger distillation baselines at fewer network function evaluations (NFEs). The efficiency comes from the teacher anchor: unlike CTM, the student does not spend capacity reconciling self-generated targets at high noise levels, where those targets are least reliable. Full numbers are in the table at the end.
MeanFlow: ground truth for the jump function
MeanFlow [6] finds a quantity the network can predict whose true value is computable directly from data, with no teacher at all: the average velocity.
Average velocity: a ground-truth two-time quantity
“Average velocity” here means what it meant in physics class: total displacement divided by elapsed time. Go from $z_t$ to $z_r$ over an interval of length $t - r$, and the average velocity is the displacement divided by that interval:
This looks unhelpful at first, since $z_r$ is exactly the thing we cannot compute without integrating the ODE. But the linear interpolation paths make the numerator easy. Because $z_t = (1-t)x_0 + tx_1$, the difference $z_t - z_r$ simplifies. Subtract the two interpolations:
Divide by $(t-r)$ and the time variables cancel completely:
It is a fixed quantity for each training pair, read straight off the data with no network evaluation or approximation. This is the same quantity a shortcut model approximated with two half-steps, except MeanFlow has it in closed form, with no composition error. And it makes one-step generation exact in principle: start at pure noise, subtract the average velocity over the full interval, and the result is $x_0$ in a single move, with no integration error to accumulate.
There is a catch. This identity is only directly usable on the full interval $[0, 1]$, where ground-truth $(x_0, x_1)$ pairs are available. For an arbitrary intermediate triple $(z_t, r, t)$ during training, $z_r$ is unknown, and computing it means running the ODE, the slow integration we set out to avoid.
The MeanFlow identity
MeanFlow’s way around this is to rewrite the average velocity so it never mentions $z_r$. The end result, below, expresses $\bar u$ using only the instantaneous velocity at $z_t$ and a derivative of the network’s own output, both available without integrating anything. Start from the definition, written as an integral:
Now differentiate both sides with respect to $t$. The right side uses the fundamental theorem of calculus; the left side uses the product rule:
This gives a target for $\bar{u}$ with no integrals and no ODE simulation. The first term, $v(z_t, t)$, is the same data-supervised velocity flow matching uses. The second, $d\bar{u}/dt$, is the network differentiating its own output, so the target is part data supervision and part self-reference. The identity is exact, but that self-referential half is what has to be stabilised, and it is where the training difficulty comes from.
Computing $d\bar{u}/dt$: the Jacobian-vector product
The term $d\bar{u}/dt$ is a total derivative: how the network output changes as $t$ increases, through both the explicit dependence on $t$ and the implicit dependence through $z_t$, which itself moves along the flow. Expanding it by the chain rule gives a velocity-weighted spatial derivative plus an explicit time derivative,8 and the first part is a Jacobian-vector product (JVP): the Jacobian of the network output with respect to $z$, dotted with the velocity $v(z_t, t)$. Forward-mode autodiff computes the whole thing in a single modified forward pass (torch.func.jvp in PyTorch, jax.jvp in JAX), so the overhead is a fraction of one forward pass rather than the full second pass a teacher network would cost.
The full training loss applies stop-gradient to the entire target to avoid second-order gradients:
Why this loss is hard to train
α-Flow [7] takes apart the MeanFlow loss and shows that an exact identity does not guarantee stable optimisation: the loss splits into two components, TFM and TC, that fight each other early in training.
When $r = t$, the interval collapses to a single point. The average velocity over a zero-length interval is just the instantaneous velocity, so the MeanFlow identity reduces to $\bar{u} = v$ and the loss becomes ordinary flow matching. MeanFlow spends a large fraction of training in exactly this degenerate case, around three-quarters of samples in the paper’s main configuration. The reason shows up once the loss is split in two:
TFM is the flow matching component. It pushes the network to predict the instantaneous velocity field, supervised directly by data and stable to optimise. The large-fraction $r=t$ sampling ensures TFM dominates early training.
TC is the consistency component. It uses the JVP to keep predictions composing correctly across intervals, which is what gives MeanFlow its structure beyond plain flow matching. The JVP is also the source of the instability.
The TC gradient runs the Jacobian of the network output with respect to $z_t$ against the velocity vector. At high $t$ (near pure noise) the input carries almost no semantic signal, and early in training the weights are unstructured, so that Jacobian is essentially random: large in magnitude, arbitrary in direction. Multiplying a random matrix by the velocity vector gives a JVP that points nowhere useful.
So at high $t$ the TC gradients are large, random vectors that actively conflict with the TFM gradients. α-Flow documents this conflict and uses it to motivate a curriculum: a parameter $\lambda \in [0,1]$ interpolates between pure TFM ($\lambda=0$, just flow matching, completely stable) and full MeanFlow ($\lambda=1$). Training starts at $\lambda=0$ and increases as the velocity field converges, by which time the Jacobian at high $t$ begins to encode which direction the trajectory is heading, and the JVP carries real signal.
Same coarse-to-fine principle as the discretisation curriculum in consistency models, applied to a continuous parameter: stabilise the data-supervised component first; turn on the self-referential one only after the velocity field has converged enough for the JVP to mean something.
Where this leaves us
Here are all of them in one table.
| Method | Composition | Training target | Inference | Per-step cost |
|---|---|---|---|---|
| Flow matching | none | data-supervised | multi-step | regression |
| Consistency models | jump to $x_0$ | self-referential (EMA) | one-step | regression |
| CTM / Shortcut | any-pair / discrete sizes | self-referential (EMA) | one- or few-step | regression |
| Align Your Flow | any-pair via distillation | teacher-anchored | one- or few-step | regression + teacher |
| MeanFlow | continuous (semigroup identity) | data-supervised + self-referential mix | one-step | regression + JVP |
MeanFlow is the only row with a ground-truth target and one-step inference; it pays with the JVP and the training conflict from the previous section.
A local velocity is cheap to measure; the whole jump is not, and nothing measures it directly. So the network learns against a lagged copy of itself, which only works once the data-supervised part has converged, hence the curricula. The exception is having a good pretrained model to distil from, which replaces the lagged copy with a real target.
The benchmark numbers have largely caught up to multi-step diffusion. CIFAR-10 at 1 NFE is effectively saturated, and on ImageNet the teacher-anchored models pull ahead: AYF at 2 NFE reaches the best FID listed, and a 280M-parameter AYF-S beats the 1.5B-parameter sCD-XXL on IN-512.
| Method | NFE | Benchmark | FID ↓ | Training |
|---|---|---|---|---|
| Consistency Models [2] | 1 | CIFAR-10 | 3.55 | distillation / from scratch |
| iCT [3] | 1 | CIFAR-10 / IN-64 | 2.51 / 3.25 | from scratch |
| CTM [4] | 1 | CIFAR-10 / IN-64 | 1.73 / 1.92 | distillation + adversarial |
| MeanFlow [6] | 1 | IN-256 | 3.43 | from scratch |
| α-Flow [7] | 1 / 2 | IN-256 | 2.58 / 2.15 | from scratch (DiT) |
| Align Your Flow [8] | 1 | IN-64 | 2.98 | distillation (EMD) + optional adversarial |
| 2 | IN-64 | 1.25 | ||
| 4 | IN-512 (280M) | 1.70 0.24s |
IN = ImageNet. NFE = network function evaluations. AYF-S at 4 NFE (FID 1.70, 0.24s) outperforms sCD-XXL at 2 NFE (FID 1.88, 0.50s) using 5× fewer parameters.
This is one branch of the one-step literature. Parallel lines of work — flow rectification, distribution matching distillation, adversarial distillation — currently dominate at SDXL-scale text-to-image, and I don’t cover them here.
A few things feel unresolved to me. Guidance is the obvious one. CFG is what makes large-scale conditional diffusion deployable, and none of the one-step methods have a clean equivalent. AYF’s autoguidance is the best answer so far, but it needs a second trained model and only works in distillation. The architectures are also borrowed: every model here is a diffusion U-Net or DiT being repurposed, with the skip/output split from EDM and the two-time conditioning bolted on as an extra input embedding. I haven’t seen anyone ask what a network designed for the one-step objective from scratch would look like. MeanFlow’s $\bar u = x_1 - x_0$ identity is more fragile than it looks too; it relies on linear interpolation paths, and the moment you want curved schedules (which matter for sample quality at scale) the algebra stops and you are back to the integral form. And the benchmarks here are all ImageNet at 64, 256, and 512. A real one-step video model doesn’t exist yet.
My guess is the next jump is either an architecture redesign that bakes in the boundary and composition constraints, or a clean way to do guidance at one step. The compositional principle feels right. What is missing is the engineering around it.
What I find satisfying is that all five are answering one question: how do you make a long jump equal its composed shorter jumps without an integrator? The curricula, the EMA copies, the JVPs are just different ways of paying for that constraint.
References
- Lipman et al., Flow Matching for Generative Modeling, 2022.
- Song et al., Consistency Models, 2023.
- Song & Dhariwal, Improved Consistency Training for Consistency Models, 2023.
- Kim et al., Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion, ICLR 2024.
- Frans et al., One Step Diffusion via Shortcut Models, ICLR 2025.
- Geng et al., MeanFlow: Unified Average-Velocity Learning for Flow-Based Generative Models, 2025.
- Zhang et al., α-Flow: Unifying Flow Matching and Consistency Models, 2025.
- Sabour et al., Align Your Flow: Scaling Continuous-Time Flow Map Distillation, 2025.
- Lu & Song, Simplifying, Stabilizing & Scaling Continuous-Time Consistency Models, 2024.
- Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models, NeurIPS 2022.
- Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021.
-
This marginal-equivalence is the result that opened up the entire deterministic-sampler line of work in diffusion (DDIM-style integrators, exact likelihood evaluation, every method in this post). The claim: the PF-ODE shares the same time-marginal distributions $p_t(x)$ as the stochastic diffusion process for every $t$, even though one is deterministic and the other is not. The PF-ODE drift is $\frac{dx}{dt} = f(x,t) - \tfrac{1}{2}g(t)^2 \nabla_x \log p_t(x)$, with $f$ and $g$ the SDE’s drift and diffusion coefficients. The proof is direct: both processes induce the same Fokker–Planck equation for $p_t(x)$, so any solution to one is a solution to the other at the marginal level (Song et al. 2021, Appendix D.1). Trajectories differ, marginals do not. ↩
-
Any source distribution works in the flow-matching framework; Gaussian is convenient because the velocity calculation collapses cleanly, not because it is required. Recent work on optimal-transport flow matching exploits this freedom directly. ↩
-
This is a deliberate inheritance choice, not a derivation. EDM uses these schedules to keep the network’s input and output magnitudes well-conditioned across noise levels (specifically $c_\text{skip}(t) = \sigma_\text{data}^2 / (\sigma_\text{data}^2 + t^2)$ and $c_\text{out}(t) = t \cdot \sigma_\text{data} / \sqrt{\sigma_\text{data}^2 + t^2}$ in the EDM noise parameterisation). Borrowing them lets consistency models drop into existing diffusion architectures with no structural changes, just a different head and loss. ↩
-
Where the PDE comes from: differentiate $f(z_t, t) = f(z_{t-\Delta}, t-\Delta)$ with respect to $\Delta$ at $\Delta = 0$. The right side gives $-\partial_t f - \partial_z f \cdot (dz/dt)$. Along the PF-ODE, $dz/dt$ is the velocity $v(z_t, t)$. Setting the derivative to zero (so consistency holds for every small $\Delta$, not just at sampled pairs) gives $\partial_t f + v(z_t, t) \cdot \partial_z f = 0$. This is the transport equation for $f$ along the flow, the method-of-characteristics statement that $f$ is constant along PF-ODE trajectories. ↩
-
The JAX autodiff cookbook is a good primer on JVPs and forward-mode autodiff if you want a refresher. ↩
-
Briefly: pseudo-Huber is a smooth approximation to the Huber loss, which behaves like $L_2$ near zero and like $L_1$ further out; it cuts the variance contribution from a few large errors that otherwise destabilise consistency training. The lognormal noise schedule concentrates training samples around noise levels where the loss is most informative, instead of uniformly over $t$. Progressive step doubling runs the discretisation curriculum on a $\log_2 N$ schedule, doubling $N$ at preset training milestones rather than tuning a continuous ramp. iCT also removes the EMA on the target network (setting its decay to zero, so the target is a stop-gradient copy of the current weights rather than a slowly-tracking lag), which is a more important change than its placement in the trick list suggests. ↩
-
To be precise about the existential form: the theorem says that for any $\delta > 0$, there exists a consistency model $f$ with $\mathbb{E}\lVert f(z_t,t) - f^{\ast}(z_t,t)\rVert_2^2 < \delta$ uniformly in $t$, and some $N$ beyond which extra sampling steps make the generated distribution worse in $W_2$. It is therefore not a statement about every imperfect CM, but about the existence of arbitrarily-close-to-optimal ones with this pathology, which is enough to make the theorem load-bearing in the post’s argument: the failure mode is not a “you trained badly” artifact. The proof and the Gaussian assumption together yield a closed-form analysis; whether the result extends to non-Gaussian data is not formally settled. ↩
-
Written out, $\frac{d\bar{u}}{dt} = \frac{\partial \bar{u}}{\partial z}\cdot v(z_t, t) + \frac{\partial \bar{u}}{\partial t}$. In practice the two terms are not computed separately: forward-mode autodiff evaluates the whole total derivative in one pass by pushing the tangent $(v, 1)$ through the network, which is exactly what
jvpdoes. ↩