OmniVTLA: Vision-Tactile-Language-Action Model

Abstract

Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA.

Overview Data Examples Dataset Details Applications Tactile Visualization BibTeX

Overview

We introduce ObjTac, a novel multimodal dataset featuring aligned textual descriptions, video recordings, and force-based tactile data. Our collection encompasses Tactile-Vision paired samples for 56 distinct objects, systematically organized into 10 material categories.

Figure 1: Overview of ObjTac

Dataset Details

For each object, we conducted 2–5 interaction trials, with each trial lasting 10–60 seconds (sampled at 60 Hz). This yielded a total of 270,000 force data recordings. In total, we collected 135K samples with paired tactile and vision data.

Object Variety

56 distinct objects with diverse material properties and geometries

Sensor Information

Collected using Paxini Gen2 tactile sensor

Data Specifications

Parameter	Specification
Total Objects	56
Categories	10
Total Samples	135K
Modality	Text + Vision + Tactile
Tactile Sensor	Paxini Tech Gen 2
Frequency	60 Hz

Data Examples

We capture first-person-view visual recordings at 720P resolution and 30 FPS, resulting in 252 video sequences with an average duration of 18 seconds.

Object A: Stone

Hard surface with minimal deformation

Object B: Plastic lid

Lightweight with smooth surface texture

Object C: Pen

Cylindrical shape with textured grip

Object D: Plier

Metallic tool with articulated joints

Tactile Signal Visualization

We present temporal visualizations of tactile signals for four distinct object categories from our ObjTac dataset. These visualizations demonstrate the unique force patterns and dynamic responses captured by the Paxini Gen2 sensor during object interactions. In the global coordinate system, the z-axis is perpendicular to the sensor surface and points downward, while the x and y axes are parallel to the sensor surface.

Rigid Object

Figure 2: The temporal variations of tactile array force (Rigid Object)

Figure 3: The temporal variations of tactile total force (Rigid Object)

Textured Object

Figure 4: The temporal variations of tactile array force (Textured Object)

Figure 5: The temporal variations of tactile total force (Textured Object)

Deformable Object

Figure 6: The temporal variations of tactile array force (Deformable Object)

Figure 7: The temporal variations of tactile total force (Deformable Object)

Real-World Experiments

To understand the effectiveness of tactile sensing, we present some qualitative results for real-world experiments. OmniVTLA uses semantic tactile cues to stabilize grasps and execute smooth trajectories, as seen in successful lifts of the short can using the gripper and bottle using the dexterous hand.

Gripper Examples

Dexterous Hand Examples

BibTeX

If you use our dataset in your research, please cite our paper:

@article{cheng2025omnivtla, title={OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing}, author={Cheng, Zhengxue and Zhang, Yiqian and Zhang, Wenkang and Li, Haoyu and Wang, Keyu and Song, Li and Zhang, Hengdi}, journal={arXiv preprint arXiv:2508.08706}, year={2025} }

Coming Soon