One-step generative models

Diffusion and flow models denoise step by step, usually 50 or more steps. Each step is a full forward pass through a billion-parameter network. A single SDXL image takes several seconds on an A100 this way, even though the network itself runs in tens of milliseconds. Getting to few-step generation means cutting down the number of denoising steps. The most promising way to do that right now is to learn the jump between noise levels directly, a flow map, instead of learning the local velocity and integrating it. MeanFlow is the cleanest version of this idea I’ve seen. I wanted to write down how it works, where the math comes from, and why it’s so hard to train.

The post leans on a small set of symbols, collected here for reference. Each is also defined where it first appears, so you can skip this and refer back.

Symbol	Meaning
$x_0,\ x_1$	a clean data sample and a Gaussian noise sample, the two ends of the trajectory
$t$	how noisy the sample is, from $t=0$ (clean data) to $t=1$ (pure noise)
$s,\ r$	other time values, the destination of a jump (a cleaner level than $t$)
$z_t$	the partly-noised sample at time $t$, somewhere between $x_0$ and $x_1$ (likewise $z_s$, $z_r$ at other times)
$v(z_t, t)$	velocity: the direction and speed to move at $z_t$ to follow the trajectory
$\bar{u}(z_t, r, t)$	average velocity from $t$ to $r$: the displacement divided by the elapsed time
$\theta$	the trained network's weights; $f_\theta$ is a network with weights $\theta$, and $\theta^-$ is a slow-moving copy of them

Generation as transport

Generating a sample (say, an image of a dog) means moving probability mass from a noise distribution to the data distribution. This transport can be stochastic (a forward and reverse diffusion SDE) or deterministic (an ODE); every method in this post lives in the deterministic regime, on a path called the probability flow ODE (PF-ODE) ^[11].¹ Parameterise it by time $t$ running from $t=1$ (pure Gaussian noise) to $t=0$ (clean data). Write $z_t$ for the point on the trajectory at time $t$, and $x_0$ for the clean sample at $t=0$. Because it is deterministic, the path can be run forwards or backwards exactly.

Think of a soap bubble. Press it slowly and the surface deforms; every point on the film follows a smooth path. Release the pressure and it snaps back along the exact same path. The PF-ODE is that elastic surface in probability space, and $z_t$ is one spot on the surface at time $t$.

Standard diffusion and flow models learn to estimate the PF-ODE velocity locally, then numerically solve the learned ODE step by step from noise to data. At each $z_t$ the network gives the speed and direction to move to keep heading toward the dog (or whatever target the trajectory points to). Apply that, take a step, repeat.

A noise distribution flowing into a data distribution

The PF-ODE moves a probability distribution from pure noise at $t=1$ into the data distribution at $t=0$. The five snapshots show the cloud progressively contracting and structuring itself as time decreases. A single sample $z_t$ rides along this flow; every one-step method in the post is trying to skip directly from one snapshot to another.

The core problem with step-by-step integration is that the local velocity at $z_t$ says nothing about where the trajectory ends up globally. The flow has to be followed closely, one small step at a time, or it drifts off course and ends up somewhere wrong. Following it that closely is what makes it expensive.

Flow matching

Flow matching ^[1] is the foundation every one-step method builds on or borrows from. It learns a continuous-time flow that moves probability mass from one distribution to another. The source and destination can be any two distributions; unlike diffusion, which fixes the noisy end to Gaussian noise by construction, flow matching has no such constraint. The simplest useful case uses Gaussian noise as the source, which gives straight-line paths between noise and data:²

$$z_t \;=\; (1 - t)\, x_0 \;+\; t\, x_1$$

$x_0$ is clean data, $x_1 \sim \mathcal{N}(0,I)$, $t \in [0,1]$. At $t=0$ the sample is data; at $t=1$ it is noise.

The velocity along this path is constant: $v = \frac{dz_t}{dt} = x_1 - x_0$. It can be read directly off a training pair $(x_0, x_1)$, with no integration or simulation. A neural network $v_\theta(z_t, t)$ is then trained to predict it at every $(z_t, t)$ by minimising the regression loss $\lVert v_\theta(z_t, t) - (x_1 - x_0) \rVert^2$, which is ordinary supervised learning against a known target.

Inference is still slow, though. Each individual path $x_0 \leftrightarrow x_1$ is a straight line, but the marginal velocity field is not. At any given noisy image $z_t$, many different clean images $x_0$ are plausible, and each candidate has its own straight-line velocity pointing in a slightly different direction. The network has to output the probability-weighted average of all those directions, which traces a curved path through image space. Following a curved path with only local velocity information takes many small steps.

Scrub the demo below to watch this happen: at $t=1$ the field points toward the centroid (no cluster has been chosen), and as $t$ decreases the weights concentrate and particles fan out toward different clusters. Try the candidates buttons (1, 3, 5) to see how a single cluster gives a uniform field while multiple clusters force curvature.

Marginal velocity field with k clusters

t = 1.00

candidates

1 destination: field is uniform everywhere. A single step from any noise point lands exactly at x₀.

The marginal velocity field $v(z,t)$ (amber arrows) and its live weight decomposition (bottom panel). Each cluster on the right represents a region of data space: cats, dogs, cars. At $t=1$ (pure noise) the Gaussian weights $w_i \propto \exp(-\|z - z_t^{(i)}\|^2 / 2\sigma^2)$ over all clusters are nearly equal: the particle has no information about which cluster it will become, so the field averages all their directions and points toward the centroid. As $t$ decreases the weights concentrate: the particle's position becomes informative about which cluster it is heading to, and the field progressively commits. With one cluster the field is uniform and one step is exact. With multiple clusters, a single large step follows the initial average direction and lands between all clusters. The red ghost shows exactly where.

So flow matching has a simple training objective and slow inference, because the marginal field it learns is curved and has to be followed in small steps. The rest of this post is about recovering fast inference without giving up that training simplicity. There are two approaches:

Jump to the endpoint directly. Learn a function that maps any trajectory point straight to $x_0$.
Jump to any point, not just the endpoint. Learn a two-time function that goes from any $t$ to any $s < t$ in one step. This is the flow map idea, and the more general goal most of this post is about.

Consistency models

Consistency models ^[2] were the first method to generate competitive samples in a single step. Instead of learning a velocity and integrating it, they learn the endpoint map directly, a function $f_\theta$ that sends any point on a trajectory to that trajectory’s clean endpoint $x_0$:

$$f_\theta(z_t,\, t) \;=\; x_0 \qquad \text{for all } t \text{ on the same trajectory}$$

Applied once from pure noise, this lands directly on a clean image. Drawn along one trajectory:

same trajectory:

noise side                         clean side
z_1  ─────── z_t ─────── z_s ─────── x_0

f_θ(z_1, 1) → x_0
f_θ(z_t, t) → x_0
f_θ(z_s, s) → x_0
f_θ(x_0, 0) → x_0

For this to work, the function needs two properties. The boundary condition: at $t = 0$ (or, more precisely, a small cutoff $\varepsilon$ near zero) the function must be the identity, $f_\theta(x_0, \varepsilon) = x_0$. A clean image maps to itself. Without this anchor the network can satisfy the rest of the loss by collapsing to a constant.

The consistency condition: any two points on the same PF-ODE trajectory must map to the same $x_0$. Two different noisy versions of one clean image, fed through the network, should produce identical outputs. Without it the function is only pinned down locally and never becomes globally coherent. The figure below shows this: every point along one trajectory maps to the same destination.

Figure 1 from Song et al. (2023). A row of images along the probability flow ODE from clean data (a dog) at t=0 progressively to pure noise at t=T. Red arrows below show the consistency function f_theta mapping multiple noisy intermediate points (x_t, x_t', x_T) all back to the same clean endpoint x_0. — The consistency function $f_\theta$ takes any point on the PF-ODE trajectory and maps it back to the same clean endpoint $x_0$. The condition is consistency across the whole trajectory, not just at individual points.

Building $f_\theta$ so the boundary identity actually holds takes some care. The naive way is piecewise: set $f_\theta(x, t) = x$ at $t = \varepsilon$ and $f_\theta(x, t) = F_\theta(x, t)$ otherwise, with $F_\theta$ a free neural network. That works for the discrete-time loss but breaks under continuous-time training, since the function is no longer differentiable at $\varepsilon$.

The solution, which has stuck since, is to wire the boundary identity into the architecture with a skip / output split:

$$f_\theta(x, t) \;=\; c_\text{skip}(t)\, x \;+\; c_\text{out}(t)\, F_\theta(x, t)$$

Two scalar schedules $c_\text{skip}(t)$ and $c_\text{out}(t)$ are differentiable functions of $t$ designed so that $c_\text{skip}(\varepsilon) = 1$ and $c_\text{out}(\varepsilon) = 0$.

At $t = \varepsilon$ the formula collapses to $f_\theta(x, \varepsilon) = 1 \cdot x + 0 \cdot F_\theta = x$, so the boundary condition holds by construction and the loss never has to enforce it. And since $c_\text{skip}$, $c_\text{out}$, and $F_\theta$ are all differentiable in $t$, so is $f_\theta$, which is exactly what continuous-time training needs.

The specific functional forms for $c_\text{skip}$ and $c_\text{out}$ are inherited directly from the EDM (elucidating diffusion models) preconditioning ^[10].³

Schedule weights along the PF-ODE

The two schedules cross as $t \to \varepsilon$: the skip weight $c_{\mathrm{skip}}(t)$ rises toward one while the trunk weight $c_{\mathrm{out}}(t)$ falls to zero, leaving $f_\theta \approx z_t$ at the clean end.

Enforcing consistency via self-distillation

Which trajectory a given $z_t$ belongs to is never directly observed, so the pairs of points that should agree cannot be enumerated directly. The workaround is to take two adjacent points on the same trajectory, $z_t$ and $z_{t-\Delta}$ a small step apart, and require their predictions to match:

$$\mathcal{L}_\text{CD} \;=\; \mathbb{E}\,\bigl\lVert f_\theta(z_t,\,t) \;-\; \operatorname{sg}\!\bigl(f_{\theta^-}(z_{t-\Delta},\,t-\Delta)\bigr) \bigr\rVert^2$$

sg = stop-gradient. $\theta^-$ = EMA copy of $\theta$, updated slowly as $\theta^- \leftarrow m\,\theta^- + (1-m)\,\theta$.

The EMA copy $\theta^-$ updates slowly, typically $m \approx 0.99$, so the target moves at about 1% of the main network’s speed. If both sides of the loss moved together, they could chase each other into the constant solution from before; the stop-gradient breaks that symmetry, so gradients flow only through the left side and $\theta$ chases a target that drifts slowly underneath it. Same target network trick that stabilised DQN.⁴

Getting the adjacent point $z_{t-\Delta}$ requires one step of the PF-ODE, and this is where the teacher/student framing becomes explicit. In consistency distillation (CD), a pretrained diffusion model is the teacher: it provides reliable one-step ODE moves onto the true trajectory, and the student learns to jump straight to $x_0$ from any point on them. In consistency training (CT) there is no teacher; the network produces its own one-step moves, which adds noise and makes training harder to stabilise. CD converges faster and produces better results; CT drops the teacher dependency but needs more careful engineering.

A note on language, since three different things in this post get called “teacher.” Data supervision is clean targets read straight from training pairs, like flow matching’s $v = x_1 - x_0$. A pretrained teacher is a separately trained external network, the kind used in CD. The EMA copy $\theta^-$ is the network’s own slow-moving lag of itself. That is exactly how CT and CD differ: CT has only the EMA copy, CD has both. From here on I’ll say “EMA copy” and “pretrained teacher” to keep them apart, since later methods lean on one or the other.

The discretisation curriculum

The choice of $\Delta$ hides a tradeoff. Divide the time axis into $N$ steps, with adjacent training pairs one step apart, so $\Delta = 1/N$.

Small $N$ means a large gap. The two points are far apart on the trajectory, so the signal is strong, but a large ODE step carries large discretisation error and $z_{t-\Delta}$ only approximately lands on the true trajectory. The network is then trained to match a target that is itself a bit wrong.

Large $N$ means a small gap. A tiny ODE step is nearly exact, so the targets are accurate, but the two points are so close that their predictions already agree and the gradient is too small to learn from.

Neither extreme works. The remedy is a curriculum: start small (strong signal, rough targets), then grow $N$ (weak signal, accurate targets), so the network learns a coarse consistency function and refines it. Slide $N$ in the demo below: at $N=1$ the Euler step from $z_t$ falls far off the true curve (large red gap, strong gradient); at $N=8$ it tracks the curve but barely moves the network.

Discretisation tradeoff: signal vs target accuracy

steps N N = 1

N=1: one step covers the whole trajectory. The tangent step at z_t lands far from the true curve. Large training signal, noisy target.

The PF-ODE trajectory (grey) curves because the marginal velocity field curves. At each marked time step, the tangent arrow shows the local velocity. A single Euler step along that tangent departs from the true curve; the error is the red gap. More steps reduce the gap but shrink the per-step gradient.

Discrete-time vs continuous-time

What I described above is the discrete-time formulation: pick a grid of $N$ noise levels, define adjacent pairs on that grid, and run the curriculum on $N$. The grid exists only because the consistency condition cannot be enforced over a continuum, only at sampled pairs of points, and the curriculum on $N$ is there to manage the bias-variance tradeoff the grid introduces.

The continuous-time formulation removes the grid. Differentiating the consistency condition $f(z_t, t) = f(z_{t-\Delta}, t-\Delta)$ as $\Delta \to 0$ gives a PDE-style identity, $\partial_t f + v(z_t, t) \cdot \partial_z f = 0$ along the PF-ODE,⁵ which the loss enforces at sampled $(z_t, t)$ pairs with no adjacent point and no grid to schedule. sCT and sCD, the simplified continuous-time variants of CT and CD ^[9], use this and produce sharper results, since the finite-$\Delta$ bias is gone. The cost is a Jacobian-vector product (JVP) to compute $\partial_z f \cdot v$: forward-mode autodiff gives it in one modified forward pass without ever forming the full Jacobian.⁶ MeanFlow reuses the same machinery later.

Both formulations are hard to train in their basic form. iCT (improved consistency training) ^[3] closed much of the gap to diffusion with a set of stability fixes: pseudo-Huber losses, a lognormal noise schedule, and progressive discretisation step doubling.⁷ All of that tuning is needed because of how the training target is defined.

The training target here is always behavioural: it constrains what the network outputs at adjacent pairs of points, not what the underlying field should be. There is no ground truth for $f(z_t, t)$ independent of the network. The optimal function is defined only implicitly, through the consistency and boundary conditions, and the only way to learn it is to have the network agree with itself across adjacent pairs. That is inherently noisy and sensitive to hyperparameters.

A deeper limitation: consistency models are stuck on a single destination, the endpoint $x_0$. The signature is $f(z_t, t) = x_0$: given a point and its time, the network predicts the endpoint and nothing else. There is no way to ask it to jump to an intermediate point, so multi-step generation means running the network repeatedly and renoising between evaluations, a workaround that does not actually use the trajectory structure. The natural fix is to let the jump function land anywhere, not just $x_0$.

Consistency trajectory models: any-to-any jumps

CTM ^[4] generalises consistency models. Where a consistency function always jumps to the end of the PF-ODE trajectory, CTM lets it jump to any point along the way. This is the two-time function:

$$G_\theta(z_t,\, t,\, s) \;=\; z_s$$

From any point $z_t$ at time $t$, jump to the state $z_s$ at time $s$. Consistency models are the special case $s=0$.

With this object a sampler can take large or small steps, land at any intermediate point on the trajectory, and compose multiple jumps to refine a generation. For this to be well-posed, the function has to satisfy the semigroup property. A jump from $t$ to some intermediate $u$ followed by a jump from $u$ to $s$ should give the same result as a direct jump from $t$ to $s$:

$$G_\theta(z_t,\, t,\, s) \;=\; G_\theta\!\bigl(G_\theta(z_t,\, t,\, u),\; u,\; s\bigr) \qquad \text{for any } u \in (s,t)$$

One large jump = two composed smaller jumps. This is the composition rule at the heart of every flow-map method.

Drag the split point $u$ in the figure below to see this in action: the direct jump (top arc) and the two-leg composition (bottom arcs) always end at the same destination, wherever the split falls.

Semigroup property: one jump = two composed jumps

split point u = 0.40

Drag the split point. Both routes (direct and two-step) always land at the same destination.

The semigroup property: one jump from $t$ to $r$ equals two composed jumps through any intermediate $u$. Every flow-map method enforces exactly this constraint.

Training enforces this by sampling triples $(r, s, t)$ with $r < s < t$ and comparing the direct jump $G_\theta(z_t, t, r)$ against the composed two-step jump:

$$\mathcal{L}_\text{CTM} \;=\; \mathbb{E}\,\bigl\lVert G_\theta(z_t,\,t,\,r) \;-\; \operatorname{sg}\!\bigl(G_{\theta^-}\!\bigl(G_{\theta^-}(z_t,\,t,\,s),\;s,\;r\bigr)\bigr) \bigr\rVert^2$$

Same stop-gradient and EMA trick as consistency models. $\theta^-$ is used for both the inner and outer jump on the target side.

The flexible jump function makes multi-step generation more natural than in consistency models: chain calls with progressively smaller target times, no renoising needed. At the time of publication, CTM held the best single-step FID numbers (see the results table below).

Like consistency models, CTM still trains against a self-referential target. $G_{\theta^-}$ is the network evaluated at a slightly lagged copy of itself; there is no ground-truth two-time map independent of the network, and the only supervision comes from the model agreeing with itself across different decompositions. The richer semigroup structure makes this more stable than the consistency-model case, but the self-reference remains. CTM is also complicated in practice: random triples $(r, s, t)$, two separate network evaluations on the target side, and several components to coordinate.

Shortcut models

Shortcut models ^[5] trade CTM’s continuous triples for a discrete set of jump sizes and get a much simpler procedure in exchange. Where CTM conditions on the two absolute times $(t, s)$, a shortcut model conditions on the current noise level $t$ and a step size $d = t - s$, so it predicts where a jump of size $d$ from $z_t$ lands.

$$v_\theta(z_t,\, t,\, d) \;\approx\; \frac{z_t - z_{t-d}}{d}$$

Predict the average displacement per unit time over a step of size $d$. At $d \to 0$ this recovers instantaneous velocity.

A plain flow matching network only knows “I am at noise level $t$.” A shortcut model also knows “and I want to travel a distance $d$ in one step.” That extra input lets it calibrate the prediction to the jump size, so the same network can take a big step or a small one at different points during generation.

The bootstrapping training procedure

The training procedure has to work around one problem: the ground-truth $z_{t-d}$ cannot be computed directly without running the full ODE. So the target is built from smaller steps the network can already make: two half-steps from the (stop-gradient) EMA copy are composed into a single full-step target the student matches. The picture first, then the equations.

Shortcut bootstrapping: half-steps teach full steps

Read right-to-left in time. Starting from $z_t$, the EMA network takes one half-step (amber) to reach $z_{t-d/2}$, then another half-step (sage) to land at $z_{t-d}$. The composition of those two half-steps becomes the target. The student (dashed blue) is trained to match it in a single jump of size $d$.

In equations, this is the semigroup property enforced discretely:

$$v_\theta(z_t,\,t,\,d) \;=\; \operatorname{compose}\!\bigl(v_{\theta^-}(z_t,\,t,\,d/2),\;\; v_{\theta^-}(z_{t-d/2},\,t-d/2,\,d/2)\bigr)$$

One large step of size $d$ = two composed half-steps of size $d/2$. $\theta^-$ is the EMA copy, the same stop-gradient stabiliser that appears in consistency models.

Training bootstraps upward from the smallest steps, where the half-step approximation is most accurate: one-step jumps first, then two-step, then four-step, each size built from the one below it. At inference any step count works, one for speed or many for quality.

Frans et al. draw the same construction with the actual loss notation overlaid:

Figure 3 from Frans et al. (2024). Overview of shortcut model training: at d≈0 the objective reduces to flow matching against the empirical velocity x1−x0; targets for larger d are constructed by concatenating two d/2 shortcuts, with the network conditioning on the step size d. — Same idea as the schematic above, drawn the way the paper presents it: at $d \to 0$ the loss matches the empirical flow-matching velocity $x_1 - x_0$; for larger $d$, the target is built by composing two half-step predictions from the EMA model.

Shortcut models enforce composition by directly comparing network outputs: no differentiation through the network, no JVP, just a fast per-step update. The tradeoff is that the discrete half-step approximation introduces small errors that compound across many composed steps, and like CTM and consistency models before it, the training target is still self-referential. The next two methods both reduce that self-reference, in opposite ways. Align Your Flow brings in external ground truth from a pretrained teacher; MeanFlow finds internal ground truth, an exact identity that lets the network supervise itself against quantities computable directly from data.

Align Your Flow: distilling the jump function

CTM and Shortcut both train the flow map $G_\theta(z_t, t, s) = z_s$ from scratch, which is what forces the curriculum schedules and EMA copies on the target side. Align Your Flow ^[8] avoids that by distilling the map from a pretrained diffusion teacher, whose ODE trajectories give the ground truth the from-scratch methods never had.

Overview of Flow Maps from Align Your Flow (Sabour et al., 2025). Three panels show Consistency Model (s=0), Flow Map (any s,t), and Flow Matching (s→t), with their respective training objectives below. — Flow maps generalise both consistency models and flow matching by connecting any two noise levels $(s, t)$ in a single step. Setting $s=0$ recovers a consistency model; letting $s \to t$ recovers standard flow matching.

This fixes a blind spot in CTM. CTM trains the jump function so composed short jumps reproduce long ones, but it only ever checks the jumps against each other, never against a correct answer. The teacher supplies that answer: the student’s jumps are trained to trace real ODE trajectories, so internal consistency now rests on an external anchor instead of standing alone.

The paper also proves consistency models eventually get worse with more steps. Theorem 3.1: for an isotropic Gaussian, there exist consistency models arbitrarily close to optimal in $L_2$ for which adding sampling steps beyond some $N$ increases the Wasserstein-2 distance to the true distribution.⁸ Fig. 5 of the paper shows it empirically: multi-step CM sampling peaks around 2 steps, then degrades. The mechanism is renoising. CMs jump to clean and reinject Gaussian noise to get back onto the trajectory, and over many steps that noise drifts off the teacher’s PF-ODE, so errors compound.

Flow maps avoid this failure by construction, because they step directly between noise levels and never renoise. The paper does not prove they improve monotonically, but empirically they keep getting better with more steps, exactly where CMs fall apart.

The student enforces consistency against the teacher with two losses, borrowing the fluid-dynamics distinction between Eulerian (fixed observer, watch the field go by) and Lagrangian (move along with the particle) frames. They differ in which time variable they perturb.

Objective	What's varied	Why it works	Empirical role
EMD (Eulerian Map Distillation)	Endpoint $s$ fixed, perturb starting time $t$; check that $G_\theta(z_t, t, s)$ is invariant as $t$ moves along the teacher trajectory.	Generalises the continuous-time consistency and flow matching losses as the two limits of $s$; the structurally natural object to optimise.	Primary loss in all main results.
LMD (Lagrangian Map Distillation)	Starting point $t$ fixed, perturb endpoint $s$; check that $G_\theta(z_t, t, s)$ moves correctly as $s$ slides along the trajectory it predicts.	Uses the teacher's instantaneous velocity at the predicted point, so it stays faithful to the flow geometry the teacher defines.	Used as a stabiliser; on its own produces over-smoothed samples on real images, per the paper's ablations.

To replace classifier-free guidance during distillation, AYF uses autoguidance: the teacher is mixed with a weaker checkpoint of itself, $v_\phi^{\text{guided}} = \lambda v_\phi + (1 - \lambda) v_\phi^{\text{weak}}$ with $\lambda$ sampled uniformly from $[1, 3]$. Since $\lambda > 1$, this extrapolates past the teacher, away from what the weaker checkpoint would have predicted, without CFG’s overshooting failure mode.

Empirically, a small AYF student beats much larger distillation baselines at fewer network function evaluations (NFEs). The efficiency comes from the teacher anchor: unlike CTM, the student does not spend capacity reconciling self-generated targets at high noise levels, where those targets are least reliable. Full numbers are in the table at the end.

MeanFlow: ground truth for the jump function

MeanFlow ^[6] finds a quantity the network can predict whose true value is computable directly from data, with no teacher at all: the average velocity.

Average velocity: a ground-truth two-time quantity

“Average velocity” here means what it meant in physics class: total displacement divided by elapsed time. Go from $z_t$ to $z_r$ over an interval of length $t - r$, and the average velocity is the displacement divided by that interval:

$$\bar{u}(z_t,\, r,\, t) \;=\; \frac{z_t - z_r}{t - r}$$

$z_r$ is the point reached by following the PF-ODE from $z_t$ back to time $r$. ($r$ here plays the same role as the target time $s$ did for CTM and AYF; I keep the MeanFlow paper's letter.)

This looks unhelpful at first, since $z_r$ is exactly the thing we cannot compute without integrating the ODE. But the linear interpolation paths make the numerator easy. Because $z_t = (1-t)x_0 + tx_1$, the difference $z_t - z_r$ simplifies. Subtract the two interpolations:

$$z_t - z_r \;=\; \bigl[(1-t)x_0 + t\,x_1\bigr] - \bigl[(1-r)x_0 + r\,x_1\bigr] \;=\; (t - r)(x_1 - x_0)$$

The $x_0$ and $x_1$ coefficients combine into a single $(t-r)$ factor.

Divide by $(t-r)$ and the time variables cancel completely:

$$\bar{u}(z_t,\, r,\, t) \;=\; \frac{z_t - z_r}{t - r} \;=\; x_1 - x_0$$

The average velocity over any interval is just $x_1 - x_0$, with the $r$ and $t$ dependence gone and no integration required.

It is a fixed quantity for each training pair, read straight off the data with no network evaluation or approximation. This is the same quantity a shortcut model approximated with two half-steps, except MeanFlow has it in closed form, with no composition error. And it makes one-step generation exact in principle: start at pure noise, subtract the average velocity over the full interval, and the result is $x_0$ in a single move, with no integration error to accumulate.

$$x_0 \;=\; z_1 \;-\; \bar{u}_\theta(z_1,\, 0,\, 1)$$

One network call. The whole reason MeanFlow exists.

There is a catch. This identity is only directly usable on the full interval $[0, 1]$, where ground-truth $(x_0, x_1)$ pairs are available. For an arbitrary intermediate triple $(z_t, r, t)$ during training, $z_r$ is unknown, and computing it means running the ODE, the slow integration we set out to avoid.

The MeanFlow identity

MeanFlow’s way around this is to rewrite the average velocity so it never mentions $z_r$. The end result, below, expresses $\bar u$ using only the instantaneous velocity at $z_t$ and a derivative of the network’s own output, both available without integrating anything. Start from the definition, written as an integral:

$$(t - r)\cdot \bar{u}(z_t,\, r,\, t) \;=\; \int_r^t v(z_\tau,\, \tau)\, d\tau$$

Average velocity × time = total displacement = integral of instantaneous velocity. Think: distance = speed × time.

Now differentiate both sides with respect to $t$. The right side uses the fundamental theorem of calculus; the left side uses the product rule:

$$\bar{u}(z_t,\, r,\, t) \;=\; v(z_t,\, t) \;-\; (t - r)\cdot \frac{d\bar{u}}{dt}$$

The MeanFlow identity. $v(z_t, t)$ is the instantaneous flow matching velocity (ground truth from data). $d\bar{u}/dt$ is the total time derivative of the network output.

This gives a target for $\bar{u}$ with no integrals and no ODE simulation. The first term, $v(z_t, t)$, is the same data-supervised velocity flow matching uses. The second, $d\bar{u}/dt$, is the network differentiating its own output, so the target is part data supervision and part self-reference. The identity is exact, but that self-referential half is what has to be stabilised, and it is where the training difficulty comes from.

Computing $d\bar{u}/dt$: the Jacobian-vector product

The term $d\bar{u}/dt$ is a total derivative: how the network output changes as $t$ increases, through both the explicit dependence on $t$ and the implicit dependence through $z_t$, which itself moves along the flow. Expanding it by the chain rule gives a velocity-weighted spatial derivative plus an explicit time derivative,⁹ and the first part is a Jacobian-vector product (JVP): the Jacobian of the network output with respect to $z$, dotted with the velocity $v(z_t, t)$. Forward-mode autodiff computes the whole thing in a single modified forward pass (torch.func.jvp in PyTorch, jax.jvp in JAX), so the overhead is a fraction of one forward pass rather than the full second pass a teacher network would cost.

The full training loss applies stop-gradient to the entire target to avoid second-order gradients:

$$\mathcal{L}_\text{MF} \;=\; \mathbb{E}\,\bigl\lVert \bar{u}_\theta(z_t,\,r,\,t) \;-\; \operatorname{sg}\!\Bigl[v_\text{FM}(z_t,t) \;-\; (t-r)\cdot\tfrac{d\bar{u}_\theta}{dt}\Bigr] \bigr\rVert^2$$

Stop-gradient prevents gradients from flowing through the target. $v_\text{FM}$ is the flow matching ground-truth velocity.

Why this loss is hard to train

α-Flow ^[7] takes apart the MeanFlow loss and shows that an exact identity does not guarantee stable optimisation: the loss splits into two components, TFM and TC, that fight each other early in training.

When $r = t$, the interval collapses to a single point. The average velocity over a zero-length interval is just the instantaneous velocity, so the MeanFlow identity reduces to $\bar{u} = v$ and the loss becomes ordinary flow matching. MeanFlow spends a large fraction of training in exactly this degenerate case, around three-quarters of samples in the paper’s main configuration. The reason shows up once the loss is split in two:

$$\mathcal{L}_\text{MF} \;=\; \mathcal{L}_\text{TFM} \;+\; \mathcal{L}_\text{TC}$$

TFM = trajectory flow matching (data-supervised). TC = trajectory consistency (JVP-based).

TFM is the flow matching component. It pushes the network to predict the instantaneous velocity field, supervised directly by data and stable to optimise. The large-fraction $r=t$ sampling ensures TFM dominates early training.

TC is the consistency component. It uses the JVP to keep predictions composing correctly across intervals, which is what gives MeanFlow its structure beyond plain flow matching. The JVP is also the source of the instability.

The TC gradient runs the Jacobian of the network output with respect to $z_t$ against the velocity vector. At high $t$ (near pure noise) the input carries almost no semantic signal, and early in training the weights are unstructured, so that Jacobian is essentially random: large in magnitude, arbitrary in direction. Multiplying a random matrix by the velocity vector gives a JVP that points nowhere useful.

So at high $t$ the TC gradients are large, random vectors that actively conflict with the TFM gradients. α-Flow documents this conflict and uses it to motivate a curriculum: a parameter $\lambda \in [0,1]$ interpolates between pure TFM ($\lambda=0$, just flow matching, completely stable) and full MeanFlow ($\lambda=1$). Training starts at $\lambda=0$ and increases as the velocity field converges, by which time the Jacobian at high $t$ begins to encode which direction the trajectory is heading, and the JVP carries real signal.

It’s the same coarse-to-fine principle as the discretisation curriculum in consistency models, applied to a continuous parameter instead of a step count: stabilise the data-supervised component first, then turn on the self-referential one only after the velocity field has converged enough for the JVP to mean something.

Where this leaves us

Here are all of them in one table.

Method	Composition	Training target	Inference	Per-step cost
Flow matching	none	data-supervised	multi-step	regression
Consistency models	jump to $x_0$	self-referential (EMA)	one-step	regression
CTM / Shortcut	any-pair / discrete sizes	self-referential (EMA)	one- or few-step	regression
Align Your Flow	any-pair via distillation	teacher-anchored	one- or few-step	regression + teacher
MeanFlow	continuous (semigroup identity)	data-supervised + self-referential mix	one-step	regression + JVP

MeanFlow is the only row with a ground-truth target and one-step inference; it pays with the JVP and the training conflict from the previous section.

A local velocity is cheap to measure; the whole jump is not, and nothing measures it directly. So the network learns against a lagged copy of itself, which only works once the data-supervised part has converged, hence the curricula. The exception is having a good pretrained model to distil from, which replaces the lagged copy with a real target.

The benchmark numbers have largely caught up to multi-step diffusion. CIFAR-10 at 1 NFE is effectively saturated, and on ImageNet the teacher-anchored models pull ahead: AYF at 2 NFE reaches the best FID listed, and a 280M-parameter AYF-S beats the 1.5B-parameter sCD-XXL on IN-512.

Method	NFE	Benchmark	FID ↓	Training
Consistency Models ^[2]	1	CIFAR-10	3.55	distillation / from scratch
iCT ^[3]	1	CIFAR-10 / IN-64	2.51 / 3.25	from scratch
CTM ^[4]	1	CIFAR-10 / IN-64	1.73 / 1.92	distillation + adversarial
MeanFlow ^[6]	1	IN-256	3.43	from scratch
α-Flow ^[7]	1 / 2	IN-256	2.58 / 2.15	from scratch (DiT)
Align Your Flow ^[8]	1	IN-64	2.98	distillation (EMD) + optional adversarial
	2	IN-64	1.25
	4	IN-512 (280M)	1.70 0.24s

IN = ImageNet. NFE = network function evaluations. AYF-S at 4 NFE (FID 1.70, 0.24s) outperforms sCD-XXL at 2 NFE (FID 1.88, 0.50s) using 5× fewer parameters.

This is one branch of the one-step literature. Parallel lines of work — flow rectification, distribution matching distillation, adversarial distillation — currently dominate at SDXL-scale text-to-image, and I don’t cover them here.

A few things feel unresolved to me. Guidance is the obvious one. CFG is what makes large-scale conditional diffusion deployable, and none of the one-step methods have a clean equivalent. AYF’s autoguidance is the best answer so far, but it needs a second trained model and only works in distillation. The architectures are also borrowed: every model here is a diffusion U-Net or DiT being repurposed, with the skip/output split from EDM and the two-time conditioning bolted on as an extra input embedding. I haven’t seen anyone ask what a network designed for the one-step objective from scratch would look like. MeanFlow’s $\bar u = x_1 - x_0$ identity is more fragile than it looks too; it relies on linear interpolation paths, and the moment you want curved schedules (which matter for sample quality at scale) the algebra stops and you are back to the integral form. And the benchmarks here are all ImageNet at 64, 256, and 512. A real one-step video model doesn’t exist yet.

My guess is the next jump is either an architecture redesign that bakes in the boundary and composition constraints, or a clean way to do guidance at one step. The compositional principle feels right. What is missing is the engineering around it.

What I find satisfying is that all five are answering one question: how do you make a long jump equal its composed shorter jumps without an integrator? The curricula, the EMA copies, the JVPs are just different ways of paying for that constraint.

References

Lipman et al., Flow Matching for Generative Modeling, 2022.
Song et al., Consistency Models, 2023.
Song & Dhariwal, Improved Consistency Training for Consistency Models, 2023.
Kim et al., Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion, ICLR 2024.
Frans et al., One Step Diffusion via Shortcut Models, ICLR 2025.
Geng et al., MeanFlow: Unified Average-Velocity Learning for Flow-Based Generative Models, 2025.
Zhang et al., α-Flow: Unifying Flow Matching and Consistency Models, 2025.
Sabour et al., Align Your Flow: Scaling Continuous-Time Flow Map Distillation, 2025.
Lu & Song, Simplifying, Stabilizing & Scaling Continuous-Time Consistency Models, 2024.
Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models, NeurIPS 2022.
Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021.

This marginal-equivalence is the result that opened up the entire deterministic-sampler line of work in diffusion (DDIM-style integrators, exact likelihood evaluation, every method in this post). The claim: the PF-ODE shares the same time-marginal distributions $p_t(x)$ as the stochastic diffusion process for every $t$, even though one is deterministic and the other is not. The PF-ODE drift is $\frac{dx}{dt} = f(x,t) - \tfrac{1}{2}g(t)^2 \nabla_x \log p_t(x)$, with $f$ and $g$ the SDE’s drift and diffusion coefficients. The proof is direct: both processes induce the same Fokker–Planck equation for $p_t(x)$, so any solution to one is a solution to the other at the marginal level (Song et al. 2021, Appendix D.1). Trajectories differ, marginals do not. ↩
Any source distribution works in the flow-matching framework; Gaussian is convenient because the velocity calculation collapses cleanly, not because it is required. Recent work on optimal-transport flow matching exploits this freedom directly. ↩
This is a deliberate inheritance choice, not a derivation. EDM uses these schedules to keep the network’s input and output magnitudes well-conditioned across noise levels (specifically $c_\text{skip}(t) = \sigma_\text{data}^2 / (\sigma_\text{data}^2 + t^2)$ and $c_\text{out}(t) = t \cdot \sigma_\text{data} / \sqrt{\sigma_\text{data}^2 + t^2}$ in the EDM noise parameterisation). Borrowing them lets consistency models drop into existing diffusion architectures with no structural changes, just a different head and loss. ↩
In DQN, a Q-network regressing toward a target computed by itself chases a moving goalpost and can diverge; freezing a lagged copy as the target long enough to aim at fixed the same failure mode there. ↩
Where the PDE comes from: differentiate $f(z_t, t) = f(z_{t-\Delta}, t-\Delta)$ with respect to $\Delta$ at $\Delta = 0$. The right side gives $-\partial_t f - \partial_z f \cdot (dz/dt)$. Along the PF-ODE, $dz/dt$ is the velocity $v(z_t, t)$. Setting the derivative to zero (so consistency holds for every small $\Delta$, not just at sampled pairs) gives $\partial_t f + v(z_t, t) \cdot \partial_z f = 0$. This is the transport equation for $f$ along the flow, the method-of-characteristics statement that $f$ is constant along PF-ODE trajectories. ↩
The JAX autodiff cookbook is a good primer on JVPs and forward-mode autodiff if you want a refresher. ↩
Briefly: pseudo-Huber is a smooth approximation to the Huber loss, which behaves like $L_2$ near zero and like $L_1$ further out; it cuts the variance contribution from a few large errors that otherwise destabilise consistency training. The lognormal noise schedule concentrates training samples around noise levels where the loss is most informative, instead of uniformly over $t$. Progressive step doubling runs the discretisation curriculum on a $\log_2 N$ schedule, doubling $N$ at preset training milestones rather than tuning a continuous ramp. iCT also removes the EMA on the target network (setting its decay to zero, so the target is a stop-gradient copy of the current weights rather than a slowly-tracking lag), which is a more important change than its placement in the trick list suggests. ↩
To be precise about the existential form: the theorem says that for any $\delta > 0$, there exists a consistency model $f$ with $\mathbb{E}\lVert f(z_t,t) - f^{\ast}(z_t,t)\rVert_2^2 < \delta$ uniformly in $t$, and some $N$ beyond which extra sampling steps make the generated distribution worse in $W_2$. It is therefore not a statement about every imperfect CM, but about the existence of arbitrarily-close-to-optimal ones with this pathology, which is enough to make the theorem load-bearing in the post’s argument: the failure mode is not a “you trained badly” artifact. The proof and the Gaussian assumption together yield a closed-form analysis; whether the result extends to non-Gaussian data is not formally settled. ↩
Written out, $\frac{d\bar{u}}{dt} = \frac{\partial \bar{u}}{\partial z}\cdot v(z_t, t) + \frac{\partial \bar{u}}{\partial t}$. In practice the two terms are not computed separately: forward-mode autodiff evaluates the whole total derivative in one pass by pushing the tangent $(v, 1)$ through the network, which is exactly what jvp does. ↩