Mini Diffuser: Fast Multi-task Diffusion Policy Training Using Two-level Mini-batches

Abstract

We introduce Mini-Diffuser, a method for training multi-task robot policies that can perform a variety of tasks using vision and language as input—while training significantly faster and using far less memory than previous approaches. The key insight comes from comparing how diffusion models are used in different domains. In image generation, diffusion models refine high-dimensional pixel data. In contrast, robot actions are much simpler, typically involving only 3D positions, rotations, and gripper states. However, the conditions—such as images and language instructions—remain high-dimensional. Mini-Diffuser takes advantage of this asymmetry. Instead of generating one action per input, it generates multiple action samples for the same vision-language input. This allows the model to train over 20× more efficiently with minimal extra cost. To support this strategy, we introduce lightweight architectural changes that prevent interference between samples during training. Mini-Diffuser offers a simple, fast, and effective recipe for training generalist robot policies at scale.

Simulation on RLBench

We test on a multi-task setup of 18 manipulation tasks on RLBench. All models use 4 camera views and 100 expert demonstrations for each task. Mini Diffuser by far takes least time and memory to train, while maintain 95% of the performance of currently SOTA diffusion based model.

Close Jar

Drag Block

Hang Cups

Insert Peg

Open Drawer

Place Wine

Put Block in Drawer

Put Cash in Safe

Put in Cupboard

Screw Lightbulb

Slide Block

Sort Shape

Stack Blocks

Stack Cups

Sweep Dust

Take Meat off Grill

Touch Button

Turn Tap

Mini Diffuser Actor in the real world

We train a multi-task Mini Diffuser Actor on 10 manipulation tasks in the real world to control a Franka Emika arm. All models use a single camera view and 10 demonstrations for each task. A single Mini Diffuser Actor is able to (1) solve multimodal tasks in the real world. (2) solve different task given different language condition. .