We introduce Mini-Diffuser, a method for training multi-task robot policies that can perform a variety of tasks using vision and language as input—while training significantly faster and using far less memory than previous approaches. The key insight comes from comparing how diffusion models are used in different domains. In image generation, diffusion models refine high-dimensional pixel data. In contrast, robot actions are much simpler, typically involving only 3D positions, rotations, and gripper states. However, the conditions—such as images and language instructions—remain high-dimensional. Mini-Diffuser takes advantage of this asymmetry. Instead of generating one action per input, it generates multiple action samples for the same vision-language input. This allows the model to train over 20× more efficiently with minimal extra cost. To support this strategy, we introduce lightweight architectural changes that prevent interference between samples during training. Mini-Diffuser offers a simple, fast, and effective recipe for training generalist robot policies at scale.
LEFT: During training phase, B samples of the states form a Level-1 batch, where M noise actions are sampled independently under the same state conditions, building a Level-2 batch. Tokens are flattened and feed into a multi-layer model contains Masked attention module, local query module, and FiLM layers. RIGHT: During inference phase, denoising is applied only to the end-effector position. Rotation and gripper state are predicted separately via classification heads conditioned on the final denoised position.
We test on a multi-task setup of 18 manipulation tasks on RLBench. All models use 4 camera views and 100 expert demonstrations for each task. Mini Diffuser by far takes least time and memory to train, while maintain 95% of the performance of currently SOTA diffusion based model.
We train a multi-task Mini Diffuser Actor on 10 manipulation tasks in the real world to control a Franka Emika arm. All models use a single camera view and 10 demonstrations for each task. A single Mini Diffuser Actor is able to (1) solve multimodal tasks in the real world. (2) solve different task given different language condition. .