RoboInter - A Holistic Intermediate Representation Suite Towards Robotic Manipulation

Abstract

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 8 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.

Overview

RoboInter suit contains: RoboInter-Data, RoboInter-VQA, RoboInter-VLA, RoboInter-Tool

RoboInter-Data

Large-Scale Annotated Manipulation Dataset

A large-scale dataset built on DROID and RH20T, providing dense per-frame annotations across 10+ categories of intermediate representations. It substantially exceeds prior work in both scale and annotation quality.

Download Dataset

230K+

Episodes

571

Diverse Scenes

10+

Annotation Types

Segmentation

Semantic segmentation of target objects and scene elements with bounding box localization

Gripper Detection

Robot gripper state and position tracking throughout manipulation

Contact Points

Fine-grained contact locations between gripper and objects

Trajectory Prediction

Future motion trajectory of the end-effector

Affordance Regions

Object regions suitable for grasping and manipulation

Placement Proposals

Suggested target locations for object placement

Grasp Pose

Predicted grasp pose and orientation for object manipulation

Dense Language

Dense per-frame language descriptions of manipulation actions and states

Interactive Demo

Explore RoboInter-Data annotations directly in your browser

RoboInter-VQA

Embodied Visual Question Answering Benchmark

A systematic benchmark introducing 8 spatial and 20 temporal embodied VQA categories to evaluate and enhance the reasoning capabilities of vision-language models in robotic manipulation scenarios. Covers understanding, generation, and planning tasks.

Download Benchmark

8

Spatial QA Types

20

Temporal QA Types

2.3M

Training Samples

Understanding

Grounding choice, trajectory choice, trajectory direction, trajectory-language matching, contact decision, grasp pose choice

Generation

Object box, gripper box, contact box, final box, contact point, trajectory prediction

Task Planning

Task planning with choice, decision, and free-form planning formats

RoboInter-VLA

Plan-then-Execute VLA Framework

An integrated vision-language-action framework that bridges high-level planning with low-level execution via intermediate supervision. Supports both modular and end-to-end VLA variants with a planner, intermediate representation, and executor pipeline.

View Code

6

Task Types

ID+OOD

Generalization

Real

Robot Eval

Objects Collecting

Collecting scattered objects into target containers with ID and OOD generalization

Cups Stacking

Stacking cups in correct order with OOD object and position testing

Towels Folding

Folding towels with generalization to novel objects and backgrounds

Clutters Cleaning

Cleaning cluttered scenes with human-robot interaction disturbance

Tool Inserting

Precise tool insertion requiring fine-grained manipulation control

Object Sorting

Sorting objects by color with novel object and sequence generalization

RoboInter-Tool

Semi-Automatic Annotation Tool

A lightweight GUI tool that enables semi-automatic annotation of diverse intermediate representations. Features asynchronous checking, online & offline data infrastructure, and supports object/gripper annotation, subtask decomposition, and key frame labeling.

View Code

Citation

@article{robointer2025,
  title={RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation},
  author={Anonymous},
  journal={arXiv preprint},
  year={2025}
}