Flow Matching (from an applied robot learning perspective)

November 16, 2025

I wrote this given how satisfying I found the concept of flow matching to be, theoretically and empirically. I will walk through its basic theoretical derivation, recent variants, before finally running them on a toy example.

First, it helps to understand what a generative model is in the context of robotics. It is easy to see what that means in other areas: like vision, where you try to “generate” images, or in language, where a model “generates” text. In robotics, we want to “generate” actions that, if sent to a robot, would allow it to interact with the world in a purposeful way.

Flow matching is seen today as the state-of-the-art method for action generation. Compared to ACT, it can capture multi-modal behaviors rather than collapsing to a single mode. Unlike VQ-BeT, it operates directly in continuous action space. And relative to its closest predecessor, ADP, it offers faster and more stable inference.

Lets take a look at what a typical flow-matching based closed-loop inference look like at a high level, and build backwards from there:

  1. Robot captures observations oi from its environment.

  2. The observation oi gets encoded into a compant latent vector through an encoder zi=Eϕ(oi).

  3. A random action ai,0 is sampled from a predetermined stochastic distribution p0(a)

  4. The sampled action ai,0 is evolved using a velocity field vθ(ai,t,t,z), integrating it from t=01

  5. The final evolved action ai,1 is sent to the robot for execution (sent open-loop if it's a chunk of actions)

  6. Go to step 1

If we think of our expert actions ai,1 as coming from an unknown target distribution p1(a), and the initial action samples ai,0 that we begin our evolving process from to be drawn from a known prior distribution p0(a), then we can define a time-dependent probability density path pt(a) that connects both distributions smoothly as follows:

(2)p:[0,1]×RdR>0,

This simply means that at t=0, the distribution is our predefined prior p0(a), and at t=1, its the target distribution p1(a) that we assume represent the expert actions. As t increases from 0 to 1, samples from the distribution pt(a) gradually evolve from random noise toward our expert-like actions.

Since we know what the distribution is at t=0 (we predefine it), we need a mechanism that can continuously push it along a path toward the expert distribution as time progresses. This can be done by defining a velocity field vt(a), which specifies for every point a in the action space and time t an instantaneous direction so that the overall distribution gradually evolves toward p1(a). This velocity field is sufficient to construct a time-dependent flow ϕ:[0,1]×RdRd, defined by the following ordinary differential equation (ODE):

(2)ddtϕt(a)=vt(ϕt(a)),
(3)ϕ0(a)=a,

By now, the motivation for learning the velocity field vt should be clear. We can use it to construct the flow ϕt, which tells us how to evolve an action sampled from our known distribution over time. Ideally, this learned field should match the true underlying velocity field ut(a) that moves samples along the probability path pt(a). This leads us to the Flow Matching (FM) objective:

(4)LFM(θ)=Et,pt(a)vθ(t,a,z)ut(a)2.

In practice, optimizing LFM(θ) is not directly possible, since it requires access to the true velocity field ut(a), which can only be computed if we knew all the intermediate distributions pt(a). In reality, we only have access to expert action samples ai,1 from the final distribution p1(a), making the direct FM objective intractable.

Instead of trying to model the entire probability path pt(a) and its vector field ut(a), however, we can simplify the problem by defining a conditional path pt(aa1) for each expert action a1 that describes how a single sample should move smoothly from the prior p0(a) toward a small neighborhood around that specific expert action. This new objective is referred to as the Conditional Flow Matching (CFM) objective:

(5)LCFM(θ)=Et,q(a1),pt(aa1)vθ(t,a,z)ut(aa1)2.

Here, ut(aa1) is a conditional target vector field that defines how samples from p0(a) should flow toward a data point a1. While it might seem like a suboptimal relaxation of the original FM objective to do this on a "per sample" level, it happens to be mathematically equivalent to optimizing this CFM loss and therefore leads to the same learned velocity field! There exist an infite number of ways to construct these conditional paths between random samples and expert actions, but a simple choice is to just define their mean and std to change linearly through time, giving us this conditional flow:

(3)ψt(x)=(1(1σmin)t)a0+ta1

With this choice of conditional paths, we get what is reffered to as the Optimal-Transport Conditional Flow Matching (CFM) objective become:

(5)LCFM(θ)=Et,q(a1),pt(aa1)vθ(t,a,z)(a1(1σmin)a0)2.

Now that we have everything defined, we are ready to see how the generative model can actually be trained using data tuples of (oi,ai,1).

  1. Draw a pair of (ai,1,oi) from our demonestrations dataset.

  2. Encode the observation oi using the encoder network $z_i = E_\phi(o_i).

  3. Sample a random action ai,0 from our prior distribution p0(a), and a random time tU(0,1).

  4. Compute the intermediate action by linearly interpolating between the noise sample and the expert action:

    (4)ai,t=(1(1σmin)t)ai,0+tai,1
  5. Compute the target (ground-truth) velocity for this interpolated action using the conditional flow formulation:

    (5)ut(ai,tai,1)=ai,1ai,t1t=ai,1(1σmin)ai,0
  6. Pass the interpolated sample ai,t, and the encoded observation zi through the velocity model:

(6)u^t=vθ(ai,t,zi).
  1. Compute the training loss using the Conditional Flow Matching objective:

(7)LCFM(θ)=u^tut(ai,tai,1)2.
  1. Backpropagate the loss and update the model parameters θ with gradient descent.