RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

RoboMIND Team

(hover to display full author list)

Kun Wu1,∗, Chengkai Hou2,3,∗, Jiaming Liu2,3,∗, Zhengping Che1,∗,†, Xiaozhu Ju1,∗,†, Zhuqin Yang1, Meng Li1 , Yinuo Zhao1, Zhiyuan Xu1, Guang Yang1, Zhen Zhao1, Guangyu Li1, Zhao Jin1 , Lecheng Wang1, Jilei Mao1, Xinhua Wang1, Shichao Fan1, Ning Liu1, Pei Ren1 , Qiang Zhang1, Yaoxu Lv2, Mengzhen Liu2,3, Jingyang He2,3, Yulin Luo2,3, Zeyu Gao3 , Chenxuan Li2, Chenyang Gu2,3, Yankai Fu2,
Di Wu2, Xingyu Wang2, Sixiang Chen2,3, Zhenyu Wang2, Pengju An2,3, Siyuan Qian2,3,
Shanghang Zhang 2,3 , Jian Tang1
*Co-first Authors, Corresponding Authors, Project Leaders
1Beijing Innovation Center of Humanoid Robotics,

2State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University,

3Beijing Academy of Artificial Intelligence


We introduce RoboMIND, a benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation, comprising 55k real-world demonstration trajectories across 4 embodiments, 279 diverse tasks and 61 distinct object classes.

Abstract

In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot manipulation), featuring 55k real-world demonstration trajectories across 279 diverse tasks involving 61 different object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view RGB-D images, proprioceptive robot state information, end effector details, and linguistic task descriptions. We provide a thorough quantitative and qualitative analysis of RoboMIND across multiple dimensions, offering detailed insights into the diversity of our datasets. In our experiments, we conduct extensive real-world testing with four state-of-the-art imitation learning methods, demonstrating that training with RoboMIND data results in a high manipulation success rate and strong generalization.





Hardware Setup

For the Franka Emika Panda robots, we use cameras positioned at the top, left, and right viewpoints to record the visual information of the task trajectories. For the AgileX/Tien Kung robots, we use their built-in cameras to record visual information. For UR robots, we use an external top camera. All demonstrations are collected using high-quality human teleoperation and stored on a unified intelligence platform.



RoboMIND Data Analysis

Dataset Overview. (a) total trajectories categorized by embodiments, (b) trajectory lengths by embodiments, (c) total trajectories grouped by task categories, and (d) total trajectories based on object usage scenarios.



Distribution of objects in RoboMIND, covering most daily life settings: domestic, industrial, kitchen, office, and retail.



Left: A histogram of skill counts across tasks for four embodiments. AgileX tasks typically involve two or three combined skills, extending the task horizon. Meanwhile, Tien Kung tasks vary in length, with some comprising up to five skills per task. Right: We visualize the AX-PutCarrot task with the AgileX robot, which involves three different skills.



Language Description Annotation. We provide refined linguistic annotations for 10,000 successful robot motion trajectories.



Visualization of failed data collection cases. We present two examples of failure from Franka and AgileX. In the FR-PlacePlateInPlateRack task (second row), the Franka arm fails to align with the slot, causing the plate to slip due to operator interference. In the AX-PutCarrot task (fourth row), the AgileX gripper unexpectedly opens, dropping the carrot. These failure cases were filtered out during quality inspection to maintain dataset quality.

Experiments

We conduct comprehensive experiments employing four popular imitation learning methods, including ACT, BAKU, RDT-1B and OpenVLA on selected RoboMIND tasks to assess their performance and limitations.



Success Examples of ACT on Single Tasks

FR-PlacePearBowl

FR-SideCloseDrawer

FR-PlaceBluePink

TK-OpenTrashBin

TK-CloseTrashBin

TK-OpenDrawerLowerCabinet

AX-AppleYellowPlate

AX-CarrotGreenPlate

AX-UnpackBowl

UR-CloseTopWhiteDrawer

UR-PickRoundBread

AX-PackPlate



Success Examples of RTD-1B Finetuned on Multi-task Setting

AX-AppleBluePlate

AX-PackBowl

AX-TakePotato


Success Examples of OpenVLA Finetuned on Multi-task Setting

FR-OpenCapLid

FR-PickStrawberryInBowl

FR-SlideCloseDrawer


For more details on data analysis and experiment results, please refer to our paper. For technical questions or any other inquiries, please file a bug at the github repo.

BibTeX

@article{wu2024robomindbenchmarkmultiembodimentintelligence,
        title={RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation},
        author={Kun Wu and Chengkai Hou and Jiaming Liu and Zhengping Che and Xiaozhu Ju and Zhuqin Yang and Meng Li and Yinuo Zhao and Zhiyuan Xu and Guang Yang and Zhen Zhao and Guangyu Li and Zhao Jin and Lecheng Wang and Jilei Mao and Xinhua Wang and Shichao Fan and Ning Liu and Pei Ren and Qiang Zhang and Yaoxu Lyu and Mengzhen Liu and Jingyang He and Yulin Luo and Zeyu Gao and Chenxuan Li and Chenyang Gu and Yankai Fu and Di Wu and Xingyu Wang and Sixiang Chen and Zhenyu Wang and Pengju An and Siyuan Qian and Shanghang Zhang and Jian Tang},
        journal={arXiv preprint arXiv:2412.13877},
        year={2024}
      }