End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy
VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection
Abstract
Achieving human-like manipulation capabilities with dexterous hands for general-purpose robots remains a grand challenge. While recent Vision-Language-Action (VLA) models show promise in learning flexible skills from human-guided demonstrations, their scalability is constrained by the scarcity of high-quality training data. Existing real-robot data collection has certain inherent limitations: fully manual teleoperation imposes excessive cognitive load on human operators, limiting session duration, while automated planning often produces unnatural motions and yields a data distribution that is suboptimal for learning targeted skillful manipulation.
To address this, we propose a Shared Autonomy framework that partitions control along the macro-micro motion domains. A human operator guides the pose of the robot end-effector via intuitive VR teleoperation, while an autonomous DexGrasp-VLA policy, using real-time tactile and local visual feedback, serves as an assistant for fine-grained and force-adaptive hand control. This division of labour significantly reduces human cognitive load and enables the efficient collection of high-quality data of coordinated arm-hand demonstrations with minimum mental fatigue.
Leveraging these demonstration data, we train an end-to-end VLA policy enhanced with our proposed novel Arm-Hand Feature Enhancement module. This architecture explicitly captures both the distinct latent features of macro (arm) and micro (hand) movements and their shared representations, resulting in more natural and robust arm-hand coordination.
Furthermore, our Corrective Teleoperation system enables continuous policy improvement through human-in-the-loop failure recovery and data augmentation. Experiments show that our framework generates high-quality data with very low manpower requirements and the resulting fine-tuning can effectively learn policies that achieve around 90% success rate across a diverse set of over 50 objects, including unseen instances. The system's effectiveness is validated through comprehensive evaluations and ablation studies, highlighting its potential to develop dexterous manipulation capabilities.
Grasping Different Objects
Hand-Only Grasping Results
Arm-Hand Grasping Results
Video Presentation
BibTeX
@article{dexvla2025,
title={End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection},
author={Yu Cui, Yujian Zhang, Lina Tao, Yang Li, Xinyu Yi, Zhibin Li},
journal={arXiv preprint arXiv:2511.00139},
year={2025},
}