One-step generative models

April 23, 2026 · machine-learning, generative-models, diffusion

I have been spending a lot of time on one-step generative models, specifically MeanFlow and the broader family it belongs to. This is my attempt to build an honest mental model of how these methods work, where the math comes from, and how each one is a response to the limitations of the one before it.

The context: diffusion and flow models can now produce images, audio, and video that are nearly indistinguishable from real data, but generating a single sample requires hundreds of sequential network evaluations. That is fine for offline synthesis. It is nearly unusable for anything interactive or real-time. The last two years have seen a serious push to fix this, and the results are surprisingly good.

Starting from the goal and working backwards: what does a network need to learn to generate in one step?


The trajectory picture

Every method in this family starts from the same setup. You define a trajectory through image space: a path from pure noise at time $t=1$ to clean data at time $t=0$. This is the probability flow ODE (PF-ODE), a deterministic path with the same marginal distributions at each time $t$ as the stochastic diffusion process but without the randomness. You can run it forwards or backwards exactly. Think of probability mass as a fluid, like shampoo bubbles slowly changing shape. The PF-ODE is the velocity field that moves each bubble without tearing it apart or compressing it; the density is preserved, just the shape changes. A point $z_t$ on a trajectory is one particle of that fluid at time $t$.

The closer $t$ is to 1, the noisier the particle. The bubble is still spread out, not yet shaped like a real image. The PF-ODE tells you how fast and in which direction to move to stay on the trajectory. Standard diffusion and flow models learn to estimate this velocity locally and integrate it step by step from noise to data.

A PF-ODE trajectory from noise to data A curved path from a diffuse noise cloud on the left to a structured data cluster on the right, with three intermediate points. The path is labeled as the PF-ODE trajectory. noise t=1 data t=0 t=0.8 t=0.5 t=0.2 PF-ODE trajectory
Every method in this family lives on a trajectory like this. The question is not how the trajectory is defined; it is how much of it you have to traverse at inference time.

The core problem with step-by-step integration: the local velocity at $z_t$ tells you nothing about where the trajectory ends up globally. You have to follow it closely, one small step at a time, or you drift off course and end up somewhere wrong. This is expensive.

Two strategies for escaping this:

  1. Jump to the endpoint directly. Learn a function that maps any trajectory point to $x_0$ in one shot. This is the consistency model idea.
  2. Jump to any point, not just the endpoint. Learn a two-time function that can jump from any $t$ to any $s < t$ in one step. This is the flow map idea, and it is what CTM, shortcut models, and MeanFlow all do.

Flow matching: the foundation

Before getting to one-step methods, it helps to understand flow matching, because all the one-step methods either build on it or borrow its training structure. Flow matching [1] defines simple straight-line paths between noise and data:

$$z_t \;=\; (1 - t)\, x_0 \;+\; t\, x_1$$
$x_0$ is clean data, $x_1$ is Gaussian noise, $t \in [0,1]$. At $t=0$ you have data; at $t=1$ you have noise.

The velocity at any point on this path is constant: $v = x_0 - x_1$. This is directly computable from training pairs, no score estimation, no self-referential structure. You train a network to predict this velocity at every $(z_t, t)$, which is just supervised regression on a clean ground-truth target.

Why is inference still slow? Even though each individual path $x_0 \leftrightarrow x_1$ is a straight line, the marginal velocity field is not. At any given noisy image $z_t$, many different clean images $x_0$ are plausible, not just one. Each candidate $x_0$ has its own straight-line velocity pointing in a slightly different direction. The network has to output the probability-weighted average of all those directions, which traces a curved path through image space. Following a curved path with only local velocity information requires many small steps.

t = 1.00
candidates

1 destination: field is uniform everywhere. A single step from any noise point lands exactly at x₀.

The marginal velocity field $\bar{v}(z,t)$ (amber arrows) and its live weight decomposition (bottom panel). Each cluster on the right represents a region of data space — cats, dogs, cars. At $t=1$ (pure noise) the Gaussian weights $w_i \propto \exp(-\|z - z_t^{(i)}\|^2 / 2\sigma^2)$ over all clusters are nearly equal: the particle has no information about which cluster it will become, so the field averages all their directions and points toward the centroid. As $t$ decreases the weights concentrate: the particle's position becomes informative about which cluster it is heading to, and the field progressively commits. With one cluster the field is uniform and one step is exact. With multiple clusters, a single large step follows the initial average direction and lands between all clusters — the red ghost shows exactly where.

Flow matching gives you clean, stable training but slow inference. Every method we discuss next is trying to fix the inference speed without giving up that training clarity.


Consistency models

Consistency models [2] are the first serious attack on the inference problem. Rather than learn the velocity and integrate it, learn a function that maps any point on the trajectory directly to the clean endpoint $x_0$:

$$f_\theta(z_t,\, t) \;=\; x_0 \qquad \text{for all } t \text{ on the same trajectory}$$

Apply this once from pure noise, and you get a clean image. One network call, done.

For this to actually work, the function needs two properties. First, the boundary condition: at $t = 0$ (or more precisely, a small cutoff $\varepsilon$ near zero), the function must be the identity, $f(x_0, \varepsilon) = x_0$. A completely clean image maps to itself. This is enforced architecturally, typically by a skip connection that activates near zero. Without it, the network could satisfy the rest of the loss by outputting a constant; the boundary condition pins one end of the function to something meaningful.

Second, the consistency condition: any two points on the same PF-ODE trajectory must map to the same $x_0$. If two different noisy versions of the same clean image both pass through the network, they should produce identical outputs. This is the key constraint that makes the function globally coherent rather than just locally trained.

Consistency condition: all trajectory points map to the same x0 A curved PF-ODE trajectory with four points. Dashed blue arrows from each point all converge on the same x0 endpoint at the right. t=1 t=0.7 t=0.4 t=0.1 x₀ f(z_t, t) = x₀ for every t on the same trajectory
The consistency condition says all these dashed arrows must land at exactly the same $x_0$. The function is consistent across the whole trajectory, not just at individual points.

Enforcing consistency via self-distillation

Here is the problem: you never directly observe which trajectory any $z_t$ belongs to. You cannot enumerate all the $(z_t, z_s)$ pairs that should agree. What you can do is take two adjacent points on the same trajectory, $z_t$ and $z_{t-\Delta}$ separated by a small step, and ask that their predictions agree:

$$\mathcal{L}_\text{CD} \;=\; \mathbb{E}\,\bigl\lVert f_\theta(z_t,\,t) \;-\; \operatorname{sg}\!\bigl(f_{\theta^-}(z_{t-\Delta},\,t-\Delta)\bigr) \bigr\rVert^2$$
sg = stop-gradient. $\theta^-$ = EMA copy of $\theta$, updated slowly as $\theta^- \leftarrow m\,\theta^- + (1-m)\,\theta$.

The EMA copy $\theta^-$ is updated slowly after each training step, typically $m \approx 0.99$, so the target moves at roughly 1% of the speed of the main network. This keeps the target stable enough to learn against. Without it, both sides of the loss update simultaneously and they can easily converge to the same wrong answer: outputting a constant everywhere, which technically satisfies the loss but is completely useless for generation.

The stop-gradient on the target side breaks this symmetry. Gradients only flow through the left side of the loss, so only $\theta$ is updated to chase the target. The target then drifts slowly via the EMA rule. This is the same target network trick from deep RL; it is what makes self-distillation stable.

To get the adjacent point $z_{t-\Delta}$ on the same trajectory as $z_t$, you need to take one step of the PF-ODE. In consistency distillation (CD), you use a pretrained diffusion model to do this and the teacher provides reliable ODE steps. In consistency training (CT), you estimate the score from scratch, which introduces additional noise and is harder to stabilise.

The discretisation curriculum and why it is not optional

One subtlety about consistency model training that matters. You divide the time axis into $N$ discrete steps. Adjacent training pairs are always one step apart, so the gap $\Delta = T/N$.

If $N$ is small, the gap is large. The two adjacent points are far apart on the PF-ODE trajectory. The training signal is strong (there is a lot of distance between the two predictions to align) but the targets are noisy. Taking a large step along the PF-ODE introduces large discretisation error, so $z_{t-\Delta}$ is only approximately on the right trajectory. You are training the network to agree with a somewhat wrong target.

If $N$ is large, the gap is small. The targets are very accurate (a tiny ODE step is nearly exact) but the training signal is weak. The two adjacent points are so close that their predictions are already similar. The loss gradient is tiny and training makes almost no progress.

Neither extreme works. The fix is a curriculum: start with small $N$ (coarse, strong signal, rough targets), then progressively increase $N$ (fine, weak signal, accurate targets). The network first learns a rough consistency function, then refines it.

N = 1

N=1: one step covers the whole trajectory. The tangent step at z_t lands far from the true curve. Large training signal, noisy target.

The PF-ODE trajectory (grey) curves because the marginal velocity field curves. At each marked time step, the tangent arrow shows the local velocity. A single Euler step along that tangent departs from the true curve; the error is the red gap. More steps reduces the gap but shrinks the per-step gradient.

Limitations

Consistency models prove the point: you can generate decent images in one step. That was not obvious before 2023. The original paper achieved FID 3.55 on CIFAR-10 and 6.20 on ImageNet 64×64 in a single step. iCT [3] improved this substantially with pseudo-Huber losses, a lognormal noise schedule, and progressive discretisation step doubling, reaching FID 2.51 (CIFAR-10) and 3.25 (ImageNet 64×64) in a single step, a 3-4× improvement over the original. But even these required considerable engineering effort just to be reliable.

The training target is always behavioural: it constrains what the network outputs at adjacent pairs of points, not what the underlying field should be. There is no ground truth for $f(z_t, t)$ that exists independently of the network. The optimal function is defined only implicitly, via the consistency condition and boundary condition, and can only be learned by having the network agree with itself across adjacent pairs. This is inherently noisy and sensitive to hyperparameters.

The deeper limitation: consistency models are stuck. They can only jump to one destination, the endpoint $x_0$. The function signature is $f(z_t, t) = x_0$; you tell it where you are and what time it is, and it predicts the endpoint. You cannot ask it to jump to an intermediate point. Multi-step generation therefore requires running the network multiple times and renoising between each evaluation, a clunky workaround that does not actually use the trajectory structure.


Consistency trajectory models: any-to-any jumps

CTM [4] generalises consistency models in one clean move. Recall the PF-ODE trajectory we defined: a path parameterised by time, running from noise at $t=1$ to data at $t=0$. Consistency models always jump to the end of that path. CTM removes that restriction and learns a function that can jump to any point along the PF-ODE, not just the endpoint:

$$G_\theta(x_t,\, t,\, s) \;=\; x_s$$
From any point $x_t$ at time $t$, jump to the state $x_s$ at time $s$. Consistency models are the special case $s=0$.

This is the two-time function: a completely flexible jump operator. You can take large steps or small steps, jump to any intermediate point on the trajectory, and compose multiple jumps to refine a generation. Consistency models are just the special case where you always set $s = 0$.

The key constraint that makes this well-posed is the semigroup property. If you jump from $t$ to some intermediate $u$, and then jump from $u$ to $s$, you should get the same result as jumping directly from $t$ to $s$:

$$G_\theta(x_t,\, t,\, s) \;=\; G_\theta\!\bigl(G_\theta(x_t,\, t,\, u),\; u,\; s\bigr) \qquad \text{for any } u \in (s,t)$$
One large jump = two composed smaller jumps. This is the composition rule at the heart of every flow-map method.
u = 0.40

Drag the split point. Both routes (direct and two-step) always land at the same destination.

The semigroup property: one jump from $t$ to $r$ equals two composed jumps through any intermediate $u$. Every flow-map method enforces exactly this constraint.

Training enforces this by sampling triples $(r, s, t)$ with $r < s < t$ and comparing the direct jump $G(x_t, t, r)$ against the composed two-step jump:

$$\mathcal{L}_\text{CTM} \;=\; \mathbb{E}\,\bigl\lVert G_\theta(x_t,\,t,\,r) \;-\; \operatorname{sg}\!\bigl(G_{\theta^-}\!\bigl(G_{\theta^-}(x_t,\,t,\,s),\;s,\;r\bigr)\bigr) \bigr\rVert^2$$
Same stop-gradient and EMA trick as consistency models. $\theta^-$ is used for both the inner and outer jump on the target side.

CTM achieves FID 1.73 on CIFAR-10 and 1.92 on ImageNet 64×64 in a single step, the best single-step results at the time by a significant margin. The flexible jump function also makes multi-step generation more natural: you just chain calls with progressively smaller target times, no renoising needed.

The limitation it inherits from consistency models: the training target is still self-referential. $G_{\theta^-}$ is the network evaluated at a slightly lagged version of itself. There is no ground-truth two-time map that exists independently of the network; the only supervision comes from the model agreeing with itself across different decompositions. This makes training more stable than consistency models (because the semigroup structure is richer), but the fundamental self-referential nature remains. CTM also required adversarial training as a core component to reach its best numbers, which limits how cleanly it scales.


Align Your Flow: distilling the jump function

Align Your Flow [8] works with exactly the same object as CTM: a two-time network $f_\theta(x_t, t, s) = x_s$ that jumps from any noise level $t$ to any cleaner level $s$ in one forward pass. The difference is not the architecture or the mathematical object being learned; it is the training regime. Instead of training from scratch with self-referential targets, AYF distills from a pretrained teacher: the consistency loss is replaced by an objective that asks the student’s jump function to agree with one-step moves of the teacher’s probability-flow ODE.

Overview of Flow Maps from Align Your Flow (Sabour et al., 2025). Three panels show Consistency Model (s=0), Flow Map (any s,t), and Flow Matching (s→t), with their respective training objectives below.
Figure 2 from Sabour et al. (2025). Flow maps generalise both consistency models and flow matching by connecting any two noise levels $(s, t)$ in a single step. Setting $s=0$ recovers a consistency model; letting $s \to t$ recovers standard flow matching. (Image: arXiv 2506.14603, reproduced for educational commentary.)

This framing resolves something that was implicit in CTM but never fully confronted. CTM trains the jump function so that composed shorter jumps reproduce longer ones, but it never asks whether the jump function is actually correct, only whether it is internally consistent. A pretrained teacher changes that: the teacher’s ODE trajectories are ground truth, and the student’s jumps are trained to trace them. Internal consistency is still enforced, but now there is an external anchor.

The paper also proves something sharp about consistency models: for any non-optimal consistency model, there exists a step count beyond which adding more steps monotonically worsens FID. This is not a failure of implementation; it is a structural consequence of the self-referential training target accumulating errors under composition. Flow maps trained against a teacher do not have this property. Adding steps always helps or is neutral, because each additional step is a further correction toward the teacher’s trajectory rather than a compounding of self-generated error.

In practice this matters a lot. AYF-S, a 280M-parameter model, reaches FID 1.25 at 2 NFE on ImageNet 64×64, and FID 1.70 at 4 NFE on ImageNet 512×512 in 0.24 seconds. The comparable distillation baseline (sCD-XXL, 1.5B parameters, 2 NFE) achieves FID 1.88 in 0.50 seconds: larger model, more compute, worse result. The efficiency gain comes directly from the teacher anchor: the student does not have to waste capacity on self-consistency at high noise levels where the self-referential signal is noisy and unreliable.


Shortcut models

CTM’s training samples random triples $(r, s, t)$, which requires managing a flexible two-time input and coordinating two separate network evaluations for the composed target. Shortcut models [5] ask: what is the simplest possible way to enforce the semigroup property?

The answer: condition the network on both the current noise level $t$ and the desired step size $d$. The network learns to predict where you will end up after a jump of size $d$ from $z_t$. The step size is an input, not a fixed constant.

$$v_\theta(z_t,\, t,\, d) \;\approx\; \frac{z_t - z_{t-d}}{d}$$
Predict the average displacement per unit time over a step of size $d$. At $d \to 0$ this recovers instantaneous velocity.

Think of it like this. A regular flow matching network only knows “I am at noise level $t$.” A shortcut model knows “I am at noise level $t$, and I want to travel a distance of $d$ in one step.” With that extra information, it can calibrate its prediction to the correct jump size and can be asked to take different step sizes at different points during generation.

The bootstrapping training procedure

How do you train this? You cannot compute the ground-truth $z_{t-d}$ directly, because it would require running the full ODE. But you can enforce the semigroup property discretely:

$$v_\theta(z_t,\,t,\,d) \;=\; \operatorname{compose}\!\bigl(v_{\theta^-}(z_t,\,t,\,d/2),\;\; v_{\theta^-}(z_{t-d/2},\,t-d/2,\,d/2)\bigr)$$
One large step of size $d$ = two composed half-steps of size $d/2$.

The right side is computable: take a half-step along the PF-ODE using the EMA network to reach $z_{t-d/2}$, then take another half-step from there. Compose the two predictions into a target for the full step. The EMA network on the right side plays the same stabilising role as in consistency models; it moves slowly so the target does not chase itself.

In practice, training proceeds by starting with small steps (where the half-step approximation is accurate) and progressively training larger steps using the smaller steps as building blocks. The network learns to predict one-step jumps first, then two-step, then four-step, and so on, bootstrapping upward. At inference, you can choose any step count: one step for speed, many steps for quality.

Shortcut model bootstrapping: one large step = two composed half-steps Three points on a noise timeline. The EMA network takes two half-steps to produce a composed target. The student network takes the full step in one shot. z_{t−d} landing point z_{t−d/2} halfway z_t start EMA: 2nd half-step EMA: 1st half-step student: one full step (what we are training) bootstrapping: half-steps teach full steps
The EMA network (solid arcs) takes two half-steps to produce a composed target. The student network (dashed arc) is trained to match this in a single step. Same stop-gradient / EMA pattern as consistency models, applied to the composition property.

Shortcut models avoid the JVP computation that MeanFlow requires. The composition is enforced by direct comparison of network outputs, not by differentiating through the network. This is simpler and faster per training step. The tradeoff: the discrete half-step approximation introduces small errors that compound when you compose many steps. The continuous formulation of MeanFlow avoids this, but pays the JVP cost to do so.


MeanFlow: ground truth for the jump function

Now we have the setup to understand what makes MeanFlow [6] special. Every method so far has the same fundamental limitation: the training target is self-referential. You are always asking the network to agree with itself, at adjacent points, or at composed decompositions. There is no independent ground truth for what the jump function should output.

MeanFlow’s answer is the average velocity, a quantity that has a ground-truth value computable directly from data, with no self-reference required.

Average velocity: a ground-truth two-time quantity

Define the average velocity over an interval $[r, t]$ starting from state $z_t$ as simply displacement divided by time:

$$\bar{u}(z_t,\, r,\, t) \;=\; \frac{z_t - z_r}{t - r}$$
$z_r$ is where you would land if you followed the PF-ODE from $z_t$ back to time $r$. Average velocity = distance ÷ time, the same definition as in physics.

For the linear interpolation paths of flow matching, this simplifies completely. Since $z_t = (1-t)x_0 + tx_1$ and $z_r = (1-r)x_0 + rx_1$:

$$\bar{u}(z_t,\, r,\, t) \;=\; \frac{z_t - z_r}{t - r} \;=\; x_1 - x_0$$
Independent of $r$ and $t$: the average velocity over any interval is just $x_1 - x_0$, computable directly from training data.

The average velocity does not depend on which interval you are looking at. It is a fixed quantity for each training pair $(x_0, x_1)$, readable directly from data. No network evaluation, no self-reference, no approximation. One-step generation falls out immediately:

$$x_0 \;=\; z_1 \;-\; \bar{u}_\theta(z_1,\, 0,\, 1)$$
Start from pure noise $z_1$ at $t=1$, subtract the average velocity, arrive at $x_0$. One network call.

The challenge is training a network to predict $\bar{u}$ correctly for all $(z_t, r, t)$ triples, not just the full-interval case. The definition gives a ground-truth target, but computing it directly during training is intractable.

The MeanFlow identity: turning the definition into a training signal

Computing $z_r$ requires running the PF-ODE from $z_t$ to time $r$, exactly the slow integration we are trying to avoid. MeanFlow gets around this by deriving an equivalent form that does not require $z_r$ at all. Start from the definition rewritten as an integral:

$$(t - r)\cdot \bar{u}(z_t,\, r,\, t) \;=\; \int_r^t v(z_\tau,\, \tau)\, d\tau$$
Average velocity × time = total displacement = integral of instantaneous velocity. Think: distance = speed × time.

Now differentiate both sides with respect to $t$. The right side uses the fundamental theorem of calculus; the left side uses the product rule:

$$\bar{u}(z_t,\, r,\, t) \;=\; v(z_t,\, t) \;-\; (t - r)\cdot \frac{d\bar{u}}{dt}$$
The MeanFlow identity. $v(z_t, t)$ is the instantaneous flow matching velocity (ground truth from data). $d\bar{u}/dt$ is the total time derivative of the network output.

This identity gives a target for $\bar{u}$ that involves no integrals and no ODE simulation. The right side has two pieces: $v(z_t, t)$ is the standard flow matching velocity (the same clean ground-truth target that flow matching uses), and $d\bar{u}/dt$ is the derivative of the network’s own output with respect to the time conditioning. The identity is exact, not an approximation.

Computing $d\bar{u}/dt$: the Jacobian-vector product

The term $d\bar{u}/dt$ is a total derivative: it measures how the network output changes as $t$ increases, accounting for two effects simultaneously, the explicit dependence on $t$ as a conditioning input and the implicit dependence through $z_t$ (which moves along the flow as $t$ changes). Expanding via the chain rule:

$$\frac{d\bar{u}}{dt} \;=\; \frac{\partial \bar{u}}{\partial z}\cdot v(z_t,\,t) \;+\; \frac{\partial \bar{u}}{\partial t}$$
First term: Jacobian of $\bar{u}$ w.r.t. its input $z$, multiplied by the velocity vector (a JVP). Second term: explicit partial derivative w.r.t. $t$.

The first term is a Jacobian-vector product (JVP): the Jacobian of the network output with respect to its input $z$, dotted with the velocity vector $v(z_t, t)$. This is computed via forward-mode automatic differentiation, a single modified forward pass through the network. In PyTorch: torch.func.jvp. In JAX: jax.jvp. The overhead is roughly 20% compared to standard flow matching training, which is modest; far less than running a full second forward pass for a teacher network.

The full training loss applies stop-gradient to the entire target to avoid second-order gradients:

$$\mathcal{L}_\text{MF} \;=\; \mathbb{E}\,\bigl\lVert \bar{u}_\theta(z_t,\,r,\,t) \;-\; \operatorname{sg}\!\Bigl[v_\text{FM}(z_t,t) \;-\; (t-r)\cdot\tfrac{d\bar{u}_\theta}{dt}\Bigr] \bigr\rVert^2$$
Stop-gradient prevents gradients from flowing through the target. $v_\text{FM}$ is the flow matching ground-truth velocity.

The TFM/TC conflict in MeanFlow training

MeanFlow looks clean on paper: ground-truth target, exact identity, minimal overhead. In practice, the training has an interesting internal structure that causes problems.

The 75% border case

When $r = t$, the interval collapses to a single point, and the average velocity over a zero-length interval equals the instantaneous velocity. The MeanFlow identity reduces to $\bar{u} = v$, and the loss becomes exactly the standard flow matching loss. MeanFlow uses $r = t$ for 75% of training samples. Why spend three-quarters of training on the degenerate case that ignores the average velocity entirely?

The α-Flow paper [7] explains this by showing the MeanFlow loss decomposes into two components:

$$\mathcal{L}_\text{MF} \;=\; \mathcal{L}_\text{TFM} \;+\; \mathcal{L}_\text{TC}$$
TFM = trajectory flow matching (data-supervised). TC = trajectory consistency (JVP-based). The 75% border case is dominated by TFM.

TFM is the flow matching component. It pushes the network to correctly predict the instantaneous velocity field: purely data-supervised, stable, converges quickly. The 75% sampling ensures TFM dominates early training.

TC is the consistency enforcement component. It uses the JVP to ensure predictions compose correctly across intervals. This is the part that gives MeanFlow its structure beyond plain flow matching. But TC depends on a JVP computed through the network, which is noisy at high noise levels.

Why TC is noisy at high $t$

The TC gradient uses the JVP: the Jacobian of the network output with respect to input $z_t$, multiplied by the velocity vector. At high $t$ (near pure noise) the input carries almost no semantic signal. The network’s weights at this early stage are unstructured. The Jacobian of an unstructured network with respect to its input is essentially random: large in magnitude, arbitrary in direction. Multiplying this random matrix by the velocity vector produces a JVP that points nowhere useful.

The consequence: TC gradients at high $t$ early in training are large, random vectors. They actively conflict with TFM gradients. α-Flow [7] measured this directly: the cosine similarity between TFM and TC gradient vectors is strongly negative early in training. They are pulling the network in opposite directions.

The fix is the $\lambda$ curriculum. The parameter $\lambda \in [0,1]$ interpolates between pure TFM ($\lambda=0$) and full MeanFlow ($\lambda=1$). Start training with $\lambda=0$, which is just flow matching, completely stable. As training progresses and the velocity field converges, the network starts genuinely learning the PF-ODE structure: the Jacobian at high $t$ begins to encode which direction the trajectory is heading, and the JVP becomes a reliable signal rather than noise. Then increase $\lambda$ to bring TC online. By the time TC is fully active, the network has enough PF-ODE structure that the JVP actually means something.

Same coarse-to-fine principle as the discretisation curriculum in consistency models, applied to a continuous parameter. Stabilise the data-supervised component first; introduce the self-referential component after the foundation is solid.


A unified view

Every method we have discussed enforces the same composition rule (that a long jump should equal composed shorter jumps) at different levels of precision and with different tradeoffs.

Click any method to see details

α=0 stable
α=1 expressive
The full family, unified. Each method enforces the composition rule more strictly than the one to its left, at increasing cost. MeanFlow is the only one with both a ground-truth target and one-step inference, but pays with the JVP and the TFM/TC conflict.

α-Flow [7] formalises this: all four methods are special cases of one parameterised objective. The parameter $\alpha$ interpolates between pure flow matching ($\alpha=0$, no composition) and full MeanFlow ($\alpha=1$, continuous composition). The general principle behind every scheduling trick we have seen is to start at $\alpha=0$ and anneal toward 1: the discretisation curriculum in consistency models, the 75% border-case sampling in MeanFlow, the $\lambda$ ramp-up. All of these are different parameterisations of the same idea: learn the stable data-supervised component first, then progressively enforce the self-referential consistency component.


Results and current state

The one-step generation problem is solved in the sense that it works. These numbers did not exist two years ago, and the gap with multi-step diffusion continues to close.

Method NFE Benchmark FID ↓ Training
Consistency Models [2] 1 CIFAR-10 3.55 distillation / from scratch
iCT [3] 1 CIFAR-10 / IN-64 2.51 / 3.25 from scratch
CTM [4] 1 CIFAR-10 / IN-64 1.73 / 1.92 distillation + adversarial
MeanFlow [6] 1 IN-256 3.43 from scratch
α-Flow [7] 1 / 2 IN-256 2.58 / 2.15 from scratch (DiT)
Align Your Flow [8] 1 IN-64 2.98 distillation (EMD) + optional adversarial
2 IN-64 1.25
4 IN-512 (280M) 1.70 0.24s

IN = ImageNet. NFE = network function evaluations. AYF-S at 4 NFE (FID 1.70, 0.24s) outperforms sCD-XXL at 2 NFE (FID 1.88, 0.50s) using 5× fewer parameters.

What I find most satisfying about this whole family is that the composition rule is the single unifying principle, even though it can look like a different trick in each paper. Every method is a different answer to the same question: how do you enforce that a long jump equals composed shorter jumps, while keeping training tractable? Consistency models do it globally via self-distillation. CTM does it for any pair of times. Shortcut models do it discretely with step-size conditioning. MeanFlow does it continuously via calculus. Align Your Flow shows the same ideas transfer cleanly to distillation from pretrained teachers at scale. Once you see this, the curricula, the EMA teachers, the JVP, the 75% border-case sampling, the tangent warmup: all of it falls into place.


References

  1. Lipman et al., Flow Matching for Generative Modeling, 2022.
  2. Song et al., Consistency Models, 2023.
  3. Song & Dhariwal, Improved Consistency Training for Consistency Models, 2023.
  4. Kim et al., Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion, ICLR 2024.
  5. Frans et al., One Step Diffusion via Shortcut Models, ICLR 2025.
  6. Geng et al., MeanFlow: Unified Average-Velocity Learning for Flow-Based Generative Models, 2025.
  7. Zhang et al., α-Flow: Unifying Flow Matching and Consistency Models, 2025.
  8. Sabour et al., Align Your Flow: Scaling Continuous-Time Flow Map Distillation, 2025.