Flow Matching (from a robot learning perspective)

Flow Matching (from an applied robot learning perspective)

November 16, 2025

I wrote this given how satisfying I found the concept of flow matching to be, theoretically and empirically. I will walk through its basic theoretical derivation, recent variants, before finally running them on a toy example.

First, it helps to understand what a generative model is in the context of robotics. It is easy to see what that means in other areas: like vision, where you try to “generate” images, or in language, where a model “generates” text. In robotics, we want to “generate” actions that, if sent to a robot, would allow it to interact with the world in a purposeful way.

Flow matching is seen today as the state-of-the-art method for action generation. Compared to ACT, it can capture multi-modal behaviors rather than collapsing to a single mode. Unlike VQ-BeT, it operates directly in continuous action space. And relative to its closest predecessor, ADP, it offers faster and more stable inference.

Lets take a look at what a typical flow-matching based closed-loop inference look like at a high level, and build backwards from there:

$o_i$ from its environment.
$o_i$ $z_i = E_\phi(o_i)$ .
$a_{i,0}$ $p_0(a)$
$a_{i,0}$ $v_\theta(a_{i,t}, t, z)$ $t=0\rightarrow1$
$a_{i,1}$ is sent to the robot for execution (sent open-loop if it's a chunk of actions)
Go to step 1

$a_{i,1}$ $p_1(a)$ $a_{i,0}$ $p_0(a)$ $p_t(a)$ that connects both distributions smoothly as follows:

\begin{matrix} (2) & p : [0, 1] \times R^{d} \to R_{> 0}, \end{matrix}

$t = 0$ $p_{0}(a)$ $t = 1$ $p_1(a)$ $t$ $0$ $1$ $p_t(a)$ gradually evolve from random noise toward our expert-like actions.

$t = 0$ $v_t(a)$ $a$ $t$ $p_1(a)$ $\phi : [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d$ , defined by the following ordinary differential equation (ODE):

\begin{matrix} (2) & \frac{d}{d t} ϕ_{t} (a) = v_{t} (ϕ_{t} (a)), \end{matrix}

\begin{matrix} (3) & ϕ_{0} (a) = a, \end{matrix}

$v_t$ $\phi_t$ $u_t(a)$ $p_t(a)$ . This leads us to the Flow Matching (FM) objective:

\begin{matrix} (4) & L_{F M} (θ) = E_{t, p_{t} (a)} ∥ v_{θ} (t, a, z) - u_{t} (a) ∥^{2} . \end{matrix}

$\mathcal{L}_{FM}(\theta)$ $u_t(a)$ $p_t(a)$ $a_{i,1}$ $p_1(a)$ , making the direct FM objective intractable.

$p_t(a)$ $u_t(a)$ $p_t(a \mid a_1)$ $a_1$ $p_0(a)$ toward a small neighborhood around that specific expert action. This new objective is referred to as the Conditional Flow Matching (CFM) objective:

\begin{matrix} (5) & L_{C F M} (θ) = E_{t, q (a_{1}), p_{t} (a ∣ a_{1})} ∥ v_{θ} (t, a, z) - u_{t} (a ∣ a_{1}) ∥^{2} . \end{matrix}

$u_t(a \mid a_{1})$ $p_0(a)$ $a_{1}$ . While it might seem like a suboptimal relaxation of the original FM objective to do this on a "per sample" level, it happens to be mathematically equivalent to optimizing this CFM loss and therefore leads to the same learned velocity field! There exist an infite number of ways to construct these conditional paths between random samples and expert actions, but a simple choice is to just define their mean and std to change linearly through time, giving us this conditional flow:

\begin{matrix} (3) & ψ_{t} (x) = (1 - (1 - σ_{min}) t) a_{0} + t a_{1} \end{matrix}

With this choice of conditional paths, we get what is reffered to as the Optimal-Transport Conditional Flow Matching (CFM) objective become:

\begin{matrix} (5) & L_{C F M} (θ) = E_{t, q (a_{1}), p_{t} (a ∣ a_{1})} ∥ v_{θ} (t, a, z) - (a_{1} - (1 - σ_{min}) a_{0}) ∥^{2} . \end{matrix}

$(o_i, a_{i,1})$ .

$(a_{i,1}, o_{i})$ from our demonestrations dataset.
$o_{i}$ using the encoder network $z_i = E_\phi(o_i).
$a_{i,0}$ $p_0(a)$ $t \sim \mathcal{U}(0, 1)$ .
Compute the intermediate action by linearly interpolating between the noise sample and the expert action:
$\begin{matrix} (4) & a_{i, t} = (1 - (1 - σ_{min}) t) a_{i, 0} + t a_{i, 1} \end{matrix}$
Compute the target (ground-truth) velocity for this interpolated action using the conditional flow formulation:
$\begin{matrix} (5) & u_{t} (a_{i, t} ∣ a_{i, 1}) = \frac{a_{i, 1} - a_{i, t}}{1 - t} = a_{i, 1} - (1 - σ_{min}) a_{i, 0} \end{matrix}$
$a_{i,t}$ $z_i$ through the velocity model:

\begin{matrix} (6) & {\hat{u}}_{t} = v_{θ} (a_{i, t}, z_{i}) . \end{matrix}

Compute the training loss using the Conditional Flow Matching objective:

\begin{matrix} (7) & L_{C F M} (θ) = ∥ {\hat{u}}_{t} - u_{t} (a_{i, t} ∣ a_{i, 1}) ∥^{2} . \end{matrix}

$\theta$ with gradient descent.