Mini Diffuser: Fast Multi-task Diffusion Policy Training Using Two-level Mini-batches

Yutong Hu1, Pinhao Song1, Kehan Wen2, Renaud Detry1
1KU Leuven 2ETH Zurich

Abstract

We introduce Mini-Diffuser, a method for training multi-task robot policies that can perform a variety of tasks using vision and language as input—while training significantly faster and using far less memory than previous approaches. The key insight comes from comparing how diffusion models are used in different domains. In image generation, diffusion models refine high-dimensional pixel data. In contrast, robot actions are much simpler, typically involving only 3D positions, rotations, and gripper states. However, the conditions—such as images and language instructions—remain high-dimensional. Mini-Diffuser takes advantage of this asymmetry. Instead of generating one action per input, it generates multiple action samples for the same vision-language input. This allows the model to train over 20× more efficiently with minimal extra cost. To support this strategy, we introduce lightweight architectural changes that prevent interference between samples during training. Mini-Diffuser offers a simple, fast, and effective recipe for training generalist robot policies at scale.

input image

Highlights


Model Pipeline

input image

LEFT: During training phase, B samples of the states form a Level-1 batch, where M noise actions are sampled independently under the same state conditions, building a Level-2 batch. Tokens are flattened and feed into a multi-layer model contains Masked attention module, local query module, and FiLM layers. RIGHT: During inference phase, denoising is applied only to the end-effector position. Rotation and gripper state are predicted separately via classification heads conditioned on the final denoised position.

Simulation on RLBench

We test on a multi-task setup of 18 manipulation tasks on RLBench. All models use 4 camera views and 100 expert demonstrations for each task. Mini Diffuser by far takes least time and memory to train, while maintain 95% of the performance of currently SOTA diffusion based model.

input image
Close Jar
Drag Block
Hang Cups
Insert Peg
Open Drawer
Place Wine
Put Block in Drawer
Put Cash in Safe
Put in Cupboard
Screw Lightbulb
Slide Block
Sort Shape
Stack Blocks
Stack Cups
Sweep Dust
Take Meat off Grill
Touch Button
Turn Tap

Mini Diffuser Actor in the real world

We train a multi-task Mini Diffuser Actor on 10 manipulation tasks in the real world to control a Franka Emika arm. All models use a single camera view and 10 demonstrations for each task. A single Mini Diffuser Actor is able to (1) solve multimodal tasks in the real world. (2) solve different task given different language condition. .

Stack Cups
Close Box
Sort Objects
Open Drawer
Put Fruit in box
Put Grape in Drawer
Insert Plug
Press Stapler