CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction
in Manipulation

Hao Li1,2*, Shuai Yang2,3*, Yilun Chen2, Yang Tian2, Xiaoda Yang3, Xinyi Chen2, Hanqing Wang2, Tai Wang2, Feng Zhao1, Dahua Lin4, Jiangmiao Pang2
1University of Science and Technology of China, 2Shanghai Artificial Intelligence Laboratory,
3Zhejiang University, 4The Chinese University of Hong Kong
Under review

*Indicates Equal Contribution
Introduction of CronusVLA

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.

Method of CronusVLA

Contributions

  • We propose CronusVLA, a general end-to-end framework that extends VLA models to the multi-frame paradigm. Based on single-frame pretraining, CronusVLA unifies multi-frame encoding and cross-frame decoding for action prediction, enabling scalable manipulation learning.
  • We design a multi-frame post-training strategy that aggregates motion information across frames and decodes it using a cross-frame decoder. This approach enables efficient action prediction while also supporting fast inference and long-horizon compatibility. An action adaptation mechanism is further introduced to provide action prior, resulting in considerable performance gains.
  • We conduct extensive experiments across three embodiments and diverse manipulation tasks in both the simulation and real world. CronusVLA achieves state-of-the-art performance on the simulation benchmark SimplerEnv with an average 70.9% success rate and achieves a 12.7% overall improvement over OpenVLA on LIBERO benchmark. CronusVLA also demonstrates strong performance and robustness across real-world simple and long-horizon tasks with Franka.

Real World

We evaluate our method on several real-world tasks with Franka Research 3 Robot, we utilize a third-person camera for visual input. Three task suites are designed: (1) Simple pick-and-place task; (2) Long-horizon tasks; and (3) Generalization and robustness tasks.

Real World

Simulation

Simulation experiments includes: (1) Performance comparisons on Google Robot and WidowX Robot of SimplerEnv. The experiments are conducted across 12 tasks, including both visual matching (VM) and visual aggregation (VA) settings. (2) Main results in Libero, the average success rate across 3 seeds over 500 trials per task.

Simulation
Simulation_table
Simulation_table

BibTeX


        @misc{2506.19816,
          Author = {Hao Li and Shuai Yang and Yilun Chen and Yang Tian and Xiaoda Yang and Xinyi Chen and Hanqing Wang and Tai Wang and Feng Zhao and Dahua Lin and Jiangmiao Pang},
          Title = {CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation},
          Year = {2025},
          Eprint = {arXiv:2506.19816},
          }