-->

End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy

VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection

Bytedance Seed
2025
Description of the image

Data collection and training pipeline for DexGrasp-VLA policy and arm-hand VLA policies. (a) Tactile-based DexGrasp-VLA policy for a five-finger dexterous hand, (b) Shared autonomy data collection, (c) End-to-end arm-hand policy learning with Arm-Hand Feature Enhancement, (d) Corrective human-in-the-loop teleoperation.

Abstract

Achieving human-like manipulation capabilities with dexterous hands for general-purpose robots remains a grand challenge. While recent Vision-Language-Action (VLA) models show promise in learning flexible skills from human-guided demonstrations, their scalability is constrained by the scarcity of high-quality training data. Existing real-robot data collection has certain inherent limitations: fully manual teleoperation imposes excessive cognitive load on human operators, limiting session duration, while automated planning often produces unnatural motions and yields a data distribution that is suboptimal for learning targeted skillful manipulation.

To address this, we propose a Shared Autonomy framework that partitions control along the macro-micro motion domains. A human operator guides the pose of the robot end-effector via intuitive VR teleoperation, while an autonomous DexGrasp-VLA policy, using real-time tactile and local visual feedback, serves as an assistant for fine-grained and force-adaptive hand control. This division of labour significantly reduces human cognitive load and enables the efficient collection of high-quality data of coordinated arm-hand demonstrations with minimum mental fatigue.

Leveraging these demonstration data, we train an end-to-end VLA policy enhanced with our proposed novel Arm-Hand Feature Enhancement module. This architecture explicitly captures both the distinct latent features of macro (arm) and micro (hand) movements and their shared representations, resulting in more natural and robust arm-hand coordination.

Furthermore, our Corrective Teleoperation system enables continuous policy improvement through human-in-the-loop failure recovery and data augmentation. Experiments show that our framework generates high-quality data with very low manpower requirements and the resulting fine-tuning can effectively learn policies that achieve around 90% success rate across a diverse set of over 50 objects, including unseen instances. The system's effectiveness is validated through comprehensive evaluations and ablation studies, highlighting its potential to develop dexterous manipulation capabilities.

Grasping Different Objects

Hand-Only Grasping Results

Arm-Hand Grasping Results

Video Presentation

BibTeX

@article{dexvla2025,
  title={End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection},
  author={Yu Cui, Yujian Zhang, Lina Tao, Yang Li, Xinyu Yi, Zhibin Li},
  journal={arXiv preprint arXiv:2511.00139},
  year={2025},
}