Flow Matching (from an applied robot learning perspective)
November 16, 2025
I wrote this given how satisfying I found the concept of flow matching to be, theoretically and empirically. I will walk through its basic theoretical derivation, recent variants, before finally running them on a toy example.
First, it helps to understand what a generative model is in the context of robotics. It is easy to see what that means in other areas: like vision, where you try to “generate” images, or in language, where a model “generates” text. In robotics, we want to “generate” actions that, if sent to a robot, would allow it to interact with the world in a purposeful way.
Flow matching is seen today as the state-of-the-art method for action generation. Compared to ACT, it can capture multi-modal behaviors rather than collapsing to a single mode. Unlike VQ-BeT, it operates directly in continuous action space. And relative to its closest predecessor, ADP, it offers faster and more stable inference.
Lets take a look at what a typical flow-matching based closed-loop inference look like at a high level, and build backwards from there:
Robot captures observations from its environment.
The observation gets encoded into a compant latent vector through an encoder .
A random action is sampled from a predetermined stochastic distribution
The sampled action is evolved using a velocity field , integrating it from
The final evolved action is sent to the robot for execution (sent open-loop if it's a chunk of actions)
Go to step 1
If we think of our expert actions as coming from an unknown target distribution , and the initial action samples that we begin our evolving process from to be drawn from a known prior distribution , then we can define a time-dependent probability density path that connects both distributions smoothly as follows:
This simply means that at , the distribution is our predefined prior , and at , its the target distribution that we assume represent the expert actions. As increases from to , samples from the distribution gradually evolve from random noise toward our expert-like actions.
Since we know what the distribution is at (we predefine it), we need a mechanism that can continuously push it along a path toward the expert distribution as time progresses. This can be done by defining a velocity field , which specifies for every point in the action space and time an instantaneous direction so that the overall distribution gradually evolves toward . This velocity field is sufficient to construct a time-dependent flow , defined by the following ordinary differential equation (ODE):
By now, the motivation for learning the velocity field should be clear. We can use it to construct the flow , which tells us how to evolve an action sampled from our known distribution over time. Ideally, this learned field should match the true underlying velocity field that moves samples along the probability path . This leads us to the Flow Matching (FM) objective:
In practice, optimizing is not directly possible, since it requires access to the true velocity field , which can only be computed if we knew all the intermediate distributions . In reality, we only have access to expert action samples from the final distribution , making the direct FM objective intractable.
Instead of trying to model the entire probability path and its vector field , however, we can simplify the problem by defining a conditional path for each expert action that describes how a single sample should move smoothly from the prior toward a small neighborhood around that specific expert action. This new objective is referred to as the Conditional Flow Matching (CFM) objective:
Here, is a conditional target vector field that defines how samples from should flow toward a data point . While it might seem like a suboptimal relaxation of the original FM objective to do this on a "per sample" level, it happens to be mathematically equivalent to optimizing this CFM loss and therefore leads to the same learned velocity field! There exist an infite number of ways to construct these conditional paths between random samples and expert actions, but a simple choice is to just define their mean and std to change linearly through time, giving us this conditional flow:
With this choice of conditional paths, we get what is reffered to as the Optimal-Transport Conditional Flow Matching (CFM) objective become:
Now that we have everything defined, we are ready to see how the generative model can actually be trained using data tuples of .
Draw a pair of from our demonestrations dataset.
Encode the observation using the encoder network $z_i = E_\phi(o_i).
Sample a random action from our prior distribution , and a random time .
Compute the intermediate action by linearly interpolating between the noise sample and the expert action:
Compute the target (ground-truth) velocity for this interpolated action using the conditional flow formulation:
Pass the interpolated sample , and the encoded observation through the velocity model:
Compute the training loss using the Conditional Flow Matching objective:
Backpropagate the loss and update the model parameters with gradient descent.