ET-VLA:
Embodiment Transfer Learning for Vision-Language-Action Models

Abstract

Vision-language-action (VLA) models have significantly advanced robotic learning, enabling training on large-scale, cross-embodiment data and fine-tuning for specific robots. However, state-of-the-art autoregressive VLAs struggle with multi-robot collaboration. We introduce embodiment transfer learning, denoted as ET-VLA, a novel framework for efficient and effective transfer of pre-trained VLAs to multi-robot. ET-VLA's core is Synthetic Continued Pretraining (SCP), which uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. Following SCP, the model is fine-tuned on target embodiment data. To further enhance the model performance on multi-embodiment, we present the Embodied Graph-of-Thought technique, a novel approach that formulates each sub-task as a node, that allows the VLA model to distinguish the functionalities and roles of each embodiment during task execution. Our work considers bimanual robots, a simple version of multi-robot to verify our approaches. We validate the effectiveness of our method on both simulation benchmarks and real robots covering three different bimanual embodiments. In particular, our proposed ET-VLA can outperform OpenVLA on six real-world tasks over 53.2%. We will open-source all codes to support the community in advancing VLA models for robot learning.

Workspace Setup & Task Settings

Read-World Experiments

ET-VLA:
Pick up the bread and place it into blue plate
OpenVLA:
Pick up the bread and place it into blue plate
ET-VLA:
Pick up banana and mango and place them into plate
OpenVLA:
Pick up banana and mango and place them into plate
ET-VLA:
Pick up the red plate and wipe it with sponge
OpenVLA:
Pick up the red plate and wipe it with sponge
ET-VLA:
Pick up the string straighten it and then put it down
OpenVLA:
Pick up the string straighten it and then put it down
ET-VLA:
Pick up the red plate and insert it into drying rack
OpenVLA:
Pick up the red plate and insert it into drying rack
ET-VLA:
Building blocks
OpenVLA:
Building blocks

Recovery Behaviour

ET-VLA:
    ET-VLA is capable of handling complex task sequences and demonstrates a certain level of robustness.
    we intentionally remove the column from the left arm and place it back on the table, we observe the robot hesitating before attempting to pick up the pink column again.
OpenVLA:
     However, OpenVLA struggles with completing dual-arm tasks, failing to follow the correct sequence to accomplish them. As illustrated in the video, the fine-tuned OpenVLA skips the initial step of opening the bag and directly grabs the tennis ball.

Experiments Results

Real-world experimental results, we report the final checkpoint, avoiding selective report.

BibTeX

BibTex Code Here