VLGA is the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. Existing paradigms each miss one key capability: (a) VLAs with sparse 3D perception use structured supervision such as boxes, occupancy, and lane maps, but lack dense spatial grounding; (b) injection-based VLAs expose dense 3D features to the language model, but lack dedicated geometry capacity; (c) geometry-only driving policies provide dense grounding with dedicated capacity, but remove language reasoning. (d) VLGA preserves all three by introducing a parameter-isolated geometry expert supervised with dense geometry reconstruction.
Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50 m average) and 3-second collision rate (0.18%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.
VLGA architecture. A four-expert Mixture-of-Transformers coupled by masked joint attention: an understanding expert (language and scene semantics), a perception expert (sparse agent, map, and occupancy queries), our new geometry expert (dense spatial structure), and an action expert (motion planning). The action expert attends to the other three experts and conditions on ego status to emit the trajectory. During training, the geometry stream is supervised by a dense per-pixel pointmap regression loss against LiDAR—an explicit signal on the stream's 3D content rather than the action loss alone.
On the leakage-free without-ego-status protocol, VLGA-Large ranks first among VLA methods on 15 of the 16 L2 and collision metrics, its ST-P3 L2 is the lowest of all methods (0.50 m average), and its collision rate is the lowest among VLA methods at every horizon (0.18% at 3 s).
VLGA achieves the highest Driving Score of all methods, 79.08, beating the prior state of the art, UniDriveVLA, by +0.71, with improved Success Rate and Comfortness at comparable Efficiency.
Qualitative planning comparison on the nuScenes validation set. The predicted 3-second trajectory (yellow) and ground truth (green) are projected onto the front camera. VLGA stays closer to the ground truth through turns and around nearby vehicles, while the baseline (UniDriveVLA) drifts laterally.
Pointmap reconstructions with predicted and ground-truth trajectories on nuScenes. Each row is one validation sample. Left: the six surround-camera inputs. Right: the dense 3D pointmap predicted by VLGA's geometry expert, with the green ground-truth and yellow predicted ego trajectories overlaid. The reconstructions capture road surface, lane structure, and surrounding obstacles, and the predicted trajectory closely tracks the ground truth.
@article{yao2026vlga,
title={VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving},
author={Jin Yao and Dhruva Dixith Kurra and Tom Lampo and Zezhou Cheng and Danhua Guo and Burhan Yaman},
year={2026},
eprint={2606.12396},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.12396},
}