Embodiment Transfer Learning for Vision-Language-Action Models

Abstract

Vision-language-action (VLA) models have significantly advanced robotic learning, enabling training on large-scale, cross-embodiment data and fine-tuning for specific robots. However, state-of-the-art autoregressive VLAs struggle with multi-robot collaboration. We introduce embodiment transfer learning, denoted as ET-VLA, a novel framework for efficient and effective transfer of pre-trained VLAs to multi-robot. ET-VLA's core is Synthetic Continued Pretraining (SCP), which uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. Following SCP, the model is fine-tuned on target embodiment data. To further enhance the model performance on multi-embodiment, we present the Embodied Graph-of-Thought technique, a novel approach that formulates each sub-task as a node, that allows the VLA model to distinguish the functionalities and roles of each embodiment during task execution. Our work considers bimanual robots, a simple version of multi-robot to verify our approaches. We validate the effectiveness of our method on both simulation benchmarks and real robots covering three different bimanual embodiments. In particular, our proposed ET-VLA can outperform OpenVLA on six real-world tasks over 53.2%. We will open-source all codes to support the community in advancing VLA models for robot learning.

Read-World Experiments

ET-VLA:
Pick up the bread and place it into blue plate

✅

OpenVLA:
Pick up the bread and place it into blue plate

❌

ET-VLA:
Pick up banana and mango and place them into plate

✅

OpenVLA:
Pick up banana and mango and place them into plate

❌

ET-VLA:
Pick up the red plate and wipe it with sponge

✅

OpenVLA:
Pick up the red plate and wipe it with sponge

❌

ET-VLA:
Pick up the string straighten it and then put it down

✅

OpenVLA:
Pick up the string straighten it and then put it down

❌

ET-VLA:
Pick up the red plate and insert it into drying rack

✅

OpenVLA:
Pick up the red plate and insert it into drying rack

❌

ET-VLA:
Building blocks

✅

OpenVLA:
Building blocks

❌

Recovery Behaviour

ET-VLA:
ET-VLA is capable of handling complex task sequences and demonstrates a certain level of robustness.
we intentionally remove the column from the left arm and place it back on the table, we observe the robot hesitating before attempting to pick up the pink column again.

OpenVLA:
However, OpenVLA struggles with completing dual-arm tasks, failing to follow the correct sequence to accomplish them. As illustrated in the video, the fine-tuned OpenVLA skips the initial step of opening the bag and directly grabs the tennis ball.

ET-VLA:
Embodiment Transfer Learning for Vision-Language-Action Models

Abstract

Workspace Setup & Task Settings

Read-World Experiments

ET-VLA:
Pick up the bread and place it into blue plate

✅

OpenVLA:
Pick up the bread and place it into blue plate

❌

ET-VLA:
Pick up banana and mango and place them into plate

✅

OpenVLA:
Pick up banana and mango and place them into plate

❌

ET-VLA:
Pick up the red plate and wipe it with sponge

✅

OpenVLA:
Pick up the red plate and wipe it with sponge

❌

ET-VLA:
Pick up the string straighten it and then put it down

✅

OpenVLA:
Pick up the string straighten it and then put it down

❌

ET-VLA:
Pick up the red plate and insert it into drying rack

✅

OpenVLA:
Pick up the red plate and insert it into drying rack

❌

ET-VLA:
Building blocks

✅

OpenVLA:
Building blocks

❌

Recovery Behaviour

ET-VLA:
ET-VLA is capable of handling complex task sequences and demonstrates a certain level of robustness.
we intentionally remove the column from the left arm and place it back on the table, we observe the robot hesitating before attempting to pick up the pink column again.

OpenVLA:
However, OpenVLA struggles with completing dual-arm tasks, failing to follow the correct sequence to accomplish them. As illustrated in the video, the fine-tuned OpenVLA skips the initial step of opening the bag and directly grabs the tennis ball.

Experiments Results

BibTeX

ET-VLA: Embodiment Transfer Learning for Vision-Language-Action Models

Abstract

Workspace Setup & Task Settings

Read-World Experiments

ET-VLA: Pick up the bread and place it into blue plate

✅

OpenVLA: Pick up the bread and place it into blue plate

❌

ET-VLA: Pick up banana and mango and place them into plate

✅

OpenVLA: Pick up banana and mango and place them into plate

❌

ET-VLA: Pick up the red plate and wipe it with sponge

✅

OpenVLA: Pick up the red plate and wipe it with sponge

❌

ET-VLA: Pick up the string straighten it and then put it down

✅

OpenVLA: Pick up the string straighten it and then put it down

❌

ET-VLA: Pick up the red plate and insert it into drying rack

✅

OpenVLA: Pick up the red plate and insert it into drying rack

❌

ET-VLA: Building blocks

✅

OpenVLA: Building blocks

❌

Recovery Behaviour

ET-VLA: ET-VLA is capable of handling complex task sequences and demonstrates a certain level of robustness. we intentionally remove the column from the left arm and place it back on the table, we observe the robot hesitating before attempting to pick up the pink column again.

OpenVLA: However, OpenVLA struggles with completing dual-arm tasks, failing to follow the correct sequence to accomplish them. As illustrated in the video, the fine-tuned OpenVLA skips the initial step of opening the bag and directly grabs the tennis ball.

Experiments Results

BibTeX

ET-VLA:
Embodiment Transfer Learning for Vision-Language-Action Models

ET-VLA:
Pick up the bread and place it into blue plate

OpenVLA:
Pick up the bread and place it into blue plate

ET-VLA:
Pick up banana and mango and place them into plate

OpenVLA:
Pick up banana and mango and place them into plate

ET-VLA:
Pick up the red plate and wipe it with sponge

OpenVLA:
Pick up the red plate and wipe it with sponge

ET-VLA:
Pick up the string straighten it and then put it down

OpenVLA:
Pick up the string straighten it and then put it down

ET-VLA:
Pick up the red plate and insert it into drying rack

OpenVLA:
Pick up the red plate and insert it into drying rack

ET-VLA:
Building blocks

OpenVLA:
Building blocks

ET-VLA:
ET-VLA is capable of handling complex task sequences and demonstrates a certain level of robustness.
we intentionally remove the column from the left arm and place it back on the table, we observe the robot hesitating before attempting to pick up the pink column again.

OpenVLA:
However, OpenVLA struggles with completing dual-arm tasks, failing to follow the correct sequence to accomplish them. As illustrated in the video, the fine-tuned OpenVLA skips the initial step of opening the bag and directly grabs the tennis ball.