CronusVLA

CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction
in Manipulation

Hao Li^1,2*, Shuai Yang^2,3*, Yilun Chen², Yang Tian², Xiaoda Yang³, Xinyi Chen², Hanqing Wang², Tai Wang², Feng Zhao¹, Dahua Lin⁴, Jiangmiao Pang²

¹University of Science and Technology of China, ²Shanghai Artificial Intelligence Laboratory,
³Zhejiang University, ⁴The Chinese University of Hong Kong
Under review
^*Indicates Equal Contribution

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.

Contributions

We propose CronusVLA, a general end-to-end framework that extends VLA models to the multi-frame paradigm. Based on single-frame pretraining, CronusVLA unifies multi-frame encoding and cross-frame decoding for action prediction, enabling scalable manipulation learning.
We design a multi-frame post-training strategy that aggregates motion information across frames and decodes it using a cross-frame decoder. This approach enables efficient action prediction while also supporting fast inference and long-horizon compatibility. An action adaptation mechanism is further introduced to provide action prior, resulting in considerable performance gains.
We conduct extensive experiments across three embodiments and diverse manipulation tasks in both the simulation and real world. CronusVLA achieves state-of-the-art performance on the simulation benchmark SimplerEnv with an average 70.9% success rate and achieves a 12.7% overall improvement over OpenVLA on LIBERO benchmark. CronusVLA also demonstrates strong performance and robustness across real-world simple and long-horizon tasks with Franka.

Real World

We evaluate our method on several real-world tasks with Franka Research 3 Robot, we utilize a third-person camera for visual input. Three task suites are designed: (1) Simple pick-and-place task; (2) Long-horizon tasks; and (3) Generalization and robustness tasks.

Simulation

Simulation experiments includes: (1) Performance comparisons on Google Robot and WidowX Robot of SimplerEnv. The experiments are conducted across 12 tasks, including both visual matching (VM) and visual aggregation (VA) settings. (2) Main results in Libero, the average success rate across 3 seeds over 500 trials per task.

BibTeX

@misc{2506.19816, Author = {Hao Li and Shuai Yang and Yilun Chen and Yang Tian and Xiaoda Yang and Xinyi Chen and Hanqing Wang and Tai Wang and Feng Zhao and Dahua Lin and Jiangmiao Pang}, Title = {CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation}, Year = {2025}, Eprint = {arXiv:2506.19816}, }