Prediction with Action:

Visual Policy Learning via Joint Denoising Process

Yanjiang Guo*, Yucheng Hu*, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, Jianyu Chen Tsinghua University, Shanghai Qizhi Institute, Shanghai AI Lab

NeurIPS 2024 Poster

Code

Overview of PAD

Abstract:

We introduce PAD, a novel visual policy learning framework that merges image Prediction and robot Action within a unified Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on robotic demonstration and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 38.9% relative improvement on the Metaworld benchmark, by utilizing a single policy within the pixel-input,data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world manipulation tasks with 28.0% success rate increase compared to baselines.

Experiments:

Overall

Performance improvements on Metaworld and Real world.

Scaling up computational resources leads to performance increases.

Experiments - Metaworld (Some hard tasks)

Assemble the object

Shoot the basketball

Pick the green object

Rick red object to blue

Place blue object

Push the stick

Experiments - Real-world Panda Arm Manipulation

1. Expert Tasks

in distribution tasks

Open the drawer.

Route the cable.

Pick blue block.

Place red block in red plate.

Pick red block.

Pick yellow block.

2. Unseen tasks

Middle level:

5-15 disturbances

Pick the blue block.

Pick the red block.

Pick the blue block.

Pick the yellow block.

Pick the yellow block.

3. Unseen tasks

Hard level:

Pick/place Unseen object

5-15 disturbances/ new background

Pick red stawberry.

Pick red strawberry.

Pick the eggplant.

Pick the yellow block.

Pick the red block.

Pick the orange carrot.

Place carrot in yellow bowl.

Place carrot in yellow bowl.

Real-time video recorded by third-view mobile phones.

Place blue block in yelloe plate.

Pick blue block.

Place carrot in yellow plate.

Pick stawberry.

Experiments - Visualizations of predicted images

PAD predicts futures and generates actions simultaneously.

Visualizations on bridge

Long-horizon uncertainty predicitons

(First row ground truth, second row PAD outputs)

The door is opened and a blue plate and a banana is imagined.

The carrot that was previously blocked appeared in the sink. A pear appeared in the top-left plate.

(Although the ground truth is a pepper and a banana)

Shoot the basketball

Assemble the object

Pick the green object

Push the object

Place blue object

Push the stick

First-view Third-view Prediction

Pick red block

First-view Third-view Prediction

Open the drawer

First-view Third-view Prediction

Place blue block in red plate

Page updated

Google Sites

Report abuse

This site uses cookies from Google to deliver its services and to analyze traffic. Information about your use of this site is shared with Google. By using this site, you agree to its use of cookies.

Got it