ERA

Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Hanyang Chen^1†, Mark Zhao^1†, Rui Yang^1†, Qinwei Ma¹, Ke Yang¹,

Kangrui Wang², Hao Bai¹, Zhenhailong Wang¹, Jiarui Yao¹, Rui Pan¹, Mengchao Zhang³, Jose Barreiros³, Aykut Onol³

ChengXiang Zhai¹, Heng Ji¹, Manling Li², Huan Zhang¹, Tong Zhang¹

¹UIUC, ²Northwestern University, ⁴Toyota Research Institute

^†Equal contribution

Paper Code

🤗

Dataset

Overview of ERA

We introduce the Embodied Reasoning Agent (ERA), a framework that transforms a compact Vision Language Model (VLM) into a performant and efficient embodied agent. In this work, we study: 1. What prior knowledge does embodied agent require before RL and 2 What make RL in long-horizon embodied task stable and effective? We distill them into a unified post-training regime that is capable of delivering both high-level planning agent and low-level control agent, by different curation of training data. This comprehensive approach not only solves level-specific tasks but also paves the way for future hierarchical policies and thus, more general embodied intelligence.

ERA Framework

We build ERA with the aim of a general approach where we first 1. Infuse foundational capabilities into the model, which is categorized into three kinds of embodied prior knowledge, and then 2. refine the agent by online reinforcement learning with rich process reward and turn-level GAE.

ERA Stage-1: Embodied Prior Learning

Figure 2. Illustration of Embodied Prior Learning (EPL). — **Figure 2.** Illustration of Embodied Prior Learning (EPL). EPL leverages three data sources: Augmented trajectory priors, environment-anchored priors, and external knowledge priors.

The first stage, Embodied Prior Learning, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of- environment datasets.

ERA Stage-2: Online Reinforcement Learning

Figure 3. (a) Our agent framework, and (b) a comparison of turn-level GAE and token-level GAE. — **Figure 3.** (a) Our agent framework, and (b) a comparison of turn-level GAE and token-level GAE.

In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization.

Experiments

ERA empowers a 3B model to surpass all existing training-based models with equal or larger size, and improve upon proprietary models like GPT-4o by 8.4 and 19.4 points, respectively on high-level and low-level tasks on a comprehensive evaluation platform, EmbodiedBench. Importantly, it demonstrates strong generalization capabilities across absolutely unseen tasks(Spatial and Common-Sense).

Figure 3. Error Analysis. — Table 1. Task success rates on the five subsets of EB-ALFRED and EB-Manipulation. The best result in each column is highlighted in bold. “Base”, “Complex” and “Visual” are seen subsets, while “Common” and “Spatial” are unseen subsets.

(1) Trajectory-Augmented Priors Achieve the Largest Individual Gains in Generalization.

(2) Environment-Anchored Priors Improve Seen and Unseen Tasks Equally, While External Knowledge Priors Favor Unseen Tasks.

(3) Combining Trajectory-Augmented and Environment-Anchored Priors Elicits the Best Performance.

Case Study

Below demonstrates an example: After training with ERA framework, a 3B model that originally fails on all tasks, can now perform step-by-step reasoning and actions to accomplish very challenging task: (a) on EB-ALFRED, it identifies and reflects on earlier errors to finally place a plate onto a specific spot on the table; (b) on EB-Manipulation, it accurately places the star into the correct slot of the shape sorter.

-->

BibTeX

        @article{chen2025era,
          title={ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning},
          author={Chen, Hanyang and Zhao, Mark and Yang, Rui and Ma, Qinwei and Yang, Ke and Yao, Jiarui and Wang, Kangrui and Bai, Hao and Wang, Zhenhailong and Pan, Rui and Zhang, Mengchao and Barreiros, Jose and Onol, Aykut and Zhai, ChengXiang and Ji, Heng and Li, Manling and Zhang, Huan and Zhang, Tong},
          year={2025}
        }