The code repository will be available shortly. Please check back later!
1Paxini Tech 2Shanghai Jiao Tong University
†Corresponding author
Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA.
We introduce ObjTac, a novel multimodal dataset featuring aligned textual descriptions, video recordings, and force-based tactile data. Our collection encompasses Tactile-Vision paired samples for 56 distinct objects, systematically organized into 10 material categories.
For each object, we conducted 2–5 interaction trials, with each trial lasting 10–60 seconds (sampled at 60 Hz). This yielded a total of 270,000 force data recordings. In total, we collected 135K samples with paired tactile and vision data.
56 distinct objects with diverse material properties and geometries
Collected using Paxini Gen2 tactile sensor
Parameter | Specification |
---|---|
Total Objects | 56 |
Categories | 10 |
Total Samples | 135K |
Modality | Text + Vision + Tactile |
Tactile Sensor | Paxini Tech Gen 2 |
Frequency | 60 Hz |
We capture first-person-view visual recordings at 720P resolution and 30 FPS, resulting in 252 video sequences with an average duration of 18 seconds.
Hard surface with minimal deformation
Lightweight with smooth surface texture
Cylindrical shape with textured grip
Metallic tool with articulated joints
We present temporal visualizations of tactile signals for four distinct object categories from our ObjTac dataset. These visualizations demonstrate the unique force patterns and dynamic responses captured by the Paxini Gen2 sensor during object interactions. In the global coordinate system, the z-axis is perpendicular to the sensor surface and points downward, while the x and y axes are parallel to the sensor surface.
To understand the effectiveness of tactile sensing, we present some qualitative results for real-world experiments. OmniVTLA uses semantic tactile cues to stabilize grasps and execute smooth trajectories, as seen in successful lifts of the short can using the gripper and bottle using the dexterous hand.
If you use our dataset in your research, please cite our paper: