Coming Soon

The code repository will be available shortly. Please check back later!

OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

Zhengxue Cheng1,2,†
Yiqian Zhang1
Wenkang Zhang2
Haoyu Li1
Keyu Wang2
Li Song2
Hengdi Zhang1

1Paxini Tech   2Shanghai Jiao Tong University

Corresponding author

Abstract

Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA.

Overview

We introduce ObjTac, a novel multimodal dataset featuring aligned textual descriptions, video recordings, and force-based tactile data. Our collection encompasses Tactile-Vision paired samples for 56 distinct objects, systematically organized into 10 material categories.

Figure 1: Overview of ObjTac

Dataset Details

For each object, we conducted 2–5 interaction trials, with each trial lasting 10–60 seconds (sampled at 60 Hz). This yielded a total of 270,000 force data recordings. In total, we collected 135K samples with paired tactile and vision data.

Object Variety

56 distinct objects with diverse material properties and geometries

Sensor Information

Collected using Paxini Gen2 tactile sensor

Data Specifications

Parameter Specification
Total Objects 56
Categories 10
Total Samples 135K
Modality Text + Vision + Tactile
Tactile Sensor Paxini Tech Gen 2
Frequency 60 Hz

Data Examples

We capture first-person-view visual recordings at 720P resolution and 30 FPS, resulting in 252 video sequences with an average duration of 18 seconds.

Tactile Signal Visualization

We present temporal visualizations of tactile signals for four distinct object categories from our ObjTac dataset. These visualizations demonstrate the unique force patterns and dynamic responses captured by the Paxini Gen2 sensor during object interactions. In the global coordinate system, the z-axis is perpendicular to the sensor surface and points downward, while the x and y axes are parallel to the sensor surface.

Rigid Object

Rigid object tactile visualization
Figure 2: The temporal variations of tactile array force (Rigid Object)
Rigid object tactile visualization
Figure 3: The temporal variations of tactile total force (Rigid Object)

Textured Object

Textured object tactile visualization
Figure 4: The temporal variations of tactile array force (Textured Object)
Rigid object tactile visualization
Figure 5: The temporal variations of tactile total force (Textured Object)

Deformable Object

Compliant object tactile visualization
Figure 6: The temporal variations of tactile array force (Deformable Object)
Rigid object tactile visualization
Figure 7: The temporal variations of tactile total force (Deformable Object)

Real-World Experiments

To understand the effectiveness of tactile sensing, we present some qualitative results for real-world experiments. OmniVTLA uses semantic tactile cues to stabilize grasps and execute smooth trajectories, as seen in successful lifts of the short can using the gripper and bottle using the dexterous hand.

Gripper Examples

Dexterous Hand Examples

BibTeX

If you use our dataset in your research, please cite our paper:

@article{cheng2025omnivtla, title={OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing}, author={Cheng, Zhengxue and Zhang, Yiqian and Zhang, Wenkang and Li, Haoyu and Wang, Keyu and Song, Li and Zhang, Hengdi}, journal={arXiv preprint arXiv:2508.08706}, year={2025} }