We introduce PAD, a novel visual policy learning framework that merges image Prediction and robot Action within a unified Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on robotic demonstration and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 38.9% relative improvement on the Metaworld benchmark, by utilizing a single policy within the pixel-input,data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world manipulation tasks with 28.0% success rate increase compared to baselines.