Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 8 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.
RoboInter suit contains: RoboInter-Data, RoboInter-VQA, RoboInter-VLA, RoboInter-Tool
A large-scale dataset built on DROID and RH20T, providing dense per-frame annotations across 10+ categories of intermediate representations. It substantially exceeds prior work in both scale and annotation quality.
Semantic segmentation of target objects and scene elements with bounding box localization
Robot gripper state and position tracking throughout manipulation
Fine-grained contact locations between gripper and objects
Future motion trajectory of the end-effector
Object regions suitable for grasping and manipulation
Suggested target locations for object placement
Predicted grasp pose and orientation for object manipulation
Dense per-frame language descriptions of manipulation actions and states
Explore RoboInter-Data annotations directly in your browser
A systematic benchmark introducing 8 spatial and 20 temporal embodied VQA categories to evaluate and enhance the reasoning capabilities of vision-language models in robotic manipulation scenarios. Covers understanding, generation, and planning tasks.
Grounding choice, trajectory choice, trajectory direction, trajectory-language matching, contact decision, grasp pose choice
Object box, gripper box, contact box, final box, contact point, trajectory prediction
Task planning with choice, decision, and free-form planning formats
An integrated vision-language-action framework that bridges high-level planning with low-level execution via intermediate supervision. Supports both modular and end-to-end VLA variants with a planner, intermediate representation, and executor pipeline.
Collecting scattered objects into target containers with ID and OOD generalization
Stacking cups in correct order with OOD object and position testing
Folding towels with generalization to novel objects and backgrounds
Cleaning cluttered scenes with human-robot interaction disturbance
Precise tool insertion requiring fine-grained manipulation control
Sorting objects by color with novel object and sequence generalization
A lightweight GUI tool that enables semi-automatic annotation of diverse intermediate representations. Features asynchronous checking, online & offline data infrastructure, and supports object/gripper annotation, subtask decomposition, and key frame labeling.
@article{robointer2025,
title={RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation},
author={Anonymous},
journal={arXiv preprint},
year={2025}
}