FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

1 Lab for Autonomous Robotics Research, Seoul National University 2 Georgia Institute of Technology
* Equal contribution
Paper Code (coming soon) arXiv
0:00
0:00

Overview

The main bottleneck in supervised learning of expressive policies in online RL is not expressivity itself, but where supervision comes from. Prior methods obtain supervised targets by reweighting samples from a global proposal (global IS), but in high-dimensional action spaces these samples rarely fall near target-relevant actions. The resulting proposal–target mismatch causes weight degeneracy and provides only sparse supervision for policy improvement.

FLAG addresses this bottleneck by localizing the target-matching problem (local IS). Compared to the global IS, FLAG conditions both distributions on the same flow latent variable zz, so importance sampling is performed inside a shared latent-conditioned local region. This gives the proposal meaningful overlap with the target where policy improvement actually occurs.

This simple shift from global IS to local IS makes supervised MaxEnt-RL practical for flow policies: (i)(i) no BPTT, (ii)(ii) few importance samples, and (iii)(iii) state-of-the-art performance.

0:00
0:00
Global proposal samples rarely land on high-Q-value actions (weight collapse); FLAG’s latent-local proposals concentrate informative samples where policy improvement actually happens.

Multi-goal Environment (Didactic Experiment)

Comparison of global and local importance sampling in a multi-goal environment
The multi-goal task isolates the small-sample regime: global-IS baselines fail when N8N \le 8, while FLAG recovers the optimal multi-modal behavior with only N=2N = 2 samples.

Main Result: Scaling to High-dimensional Control

Figure 3 comparing performance and GPU hours for global and local proposal sampling across MuJoCo, DMC Dog, and MyoSuite

FLAG sustains the highest return at low GPU cost as action dimensionality scales (MuJoCo → DMC Dog → MyoSuite). Detailed comparisons across budgets and critic settings are in Experimental Results · Q1.

Method

Expand each Q card to follow the argument. Q1–Q3 can be expanded; open a card to read its step-by-step reasoning.

Use the tabs: Challenge → Solution → Bottom Line. Each tab is one stage of the answer — start with the problem, then the construction, then the takeaway.

Challenge Solution Bottom Line
Q1 How do we construct local IS, and is it consistent with the original MDP?
Challenge

For local IS to be principled — not just a trick — two conditions must hold:

  • Localization. The proposal and target distributions must share a region indexed by a latent zz.
  • Consistency. Optimizing inside that local region must be the same optimization problem as optimizing the RL objective in the original MDP.

Without Consistency, local IS optimizes the wrong objective and any gains are illusory.

Q2 How do we incorporate this into MaxEnt-RL and update the policy?
Challenge

Two obstacles block plugging the z-MDP into MaxEnt-RL:

  • Intractable Entropy. The composite policy’s log-probability logπ(as)\log\pi(a \mid s) requires marginalizing over the latent zz — there is no closed form.
  • RL via Supervised Learning. We want to cast MaxEnt-RL as an EM algorithm whose policy update reduces to a supervised learning problem — avoiding backprop through the flow ODE (BPTT).
Q3 Does FLAG provably improve the policy, and how does it relate to SAC?
Challenge

FLAG updates the policy by supervised distillation, not by differentiating the objective through the flow. Two guarantees are therefore not obvious — and this section establishes both:

  • Monotonic improvement. Does FLAG’s update actually raise the objective, i.e. is Jk+1Jk\mathcal{J}_{k+1} \geq \mathcal{J}_k guaranteed?
  • Relation to SAC. We optimize a MaxEnt-RL objective — so how does FLAG’s update relate to Soft Actor–Critic (SAC), and does it inherit SAC’s soft policy improvement?

Experimental Results

We present the results around the three questions used in Section 5 of the paper.

Expand each banner to reveal results. Q1/Q2/Q3 cards and nested sub-banners (e.g., Q1.1) hide figures and tables until opened.

Click images to view detailed figure/table with caption. Cards show clean caption-free previews; the enlarged view shows the original caption from the paper.

Term

best-of-P: when sampling actions from the policy (e.g., during rollout and policy update), P candidate actions are drawn and the one with the highest Q-value is selected.

Q1 Does FLAG scale to high-dimensional action spaces under limited sample budgets?
A1
FLAG is Scalable to High-dimensional Action Spaces and Robust to Sample Budget
Q1.1Is FLAG scalable to high-dimensional action spaces?

FLAG stays in the high-return, low-runtime region as action dimensionality increases, while global-proposal baselines degrade or require larger best-of-P budgets.

Q1.2Is FLAG robust to the sample budget?

Local proposal matching keeps importance samples informative even when the update uses only a small number of samples.

Table 1 ablation study on the number of training samples in DMC Dog tasks
Learning curveLearning Curves
Q2 How does FLAG compare to action-gradient and BPTT-based actor-critic methods?
A2
FLAG Outperforms BPTT-based Actor-Critic and Action Gradient Methods
Q3 How do our key design choices---covariance scheduling and the guidance buffer---connect to the theoretical results?
A3
FLAG Key Design Choices Align with Theoretical Results
A3.1Covariance SchedulingControls σ²ₖ · Theorem 4.5 (Q3) ↑

Annealing the local covariance suppresses the covariance-dependent reward drift in Theorem 4.5 while preserving enough early exploration.

σinit (σfinal)-1(-1)-1(-2)-1(-3)-2(-2)-2(-3)-2(-4)-2(-5)-2(-6)
Return (1k) ↑0.6140.6750.7320.5880.6800.5970.5470.269
DMC Dog-run, 5 seeds. -2(-3) is the default used in Section 5.1 and 5.2.
A3.2Guidance BufferControls εₖ · Theorem 4.5 (Q3) ↑

A moderate guidance buffer reduces CFM projection error by reusing recent improved action labels without letting targets become stale.

Buffer Size010.24k51.2k102.4k204.8k
Return (1k) ↑0.6010.6800.6700.6180.589
DMC Dog-run, 5 seeds. 10.24k is the default used in Section 5.1 and 5.2.

BibTeX citation

Displaying the BibTeX entry for your paper in a code block makes it easy to copy and paste.

@misc{roman2024academic,
author = "{Roman Hauksson}",
title = "Academic Project Page Template",
year = "2024",
howpublished = "\url{https://research-template.roman.technology}",
}