PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

LiteVGGT :Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

CVPR 2026

Zhijian Shu^1,2,3*, Cheng Lin⁵, Tao Xie^2,4, Wei Yin², Ben Li⁷, Zhiyuan Pu⁷, Weize Li⁶, Yao Yao³, Xun Cao³, Xiaoyang Guo², Xiao-Xiao Long^3†

¹ Nanjing University of Posts and Telecommunications, ² Horizon Robotics,
³ Nanjing University, ⁴ Zhejiang University, ⁵ Macau University of Science and Technology,
⁶ TARS Robotics, ⁷ China Mobile Zijin Innovation Institute
^* Intern at Horizon Robotics and Nanjing University. ^† Corresponding author.

arXiv Code

For 1000 input images, LiteVGGT achieves a 10× speedup over VGGT while maintaining high accuracy in camera pose and point cloud prediction. Its scalability and robustness make large-scale scene reconstruction more efficient and reliable.

Abstract

3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10× speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: 1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; 2) token similarity acroses adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging . We analyze each token’s geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT’s core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT’s effectiveness, scalability, and robustness.

We augment VGGT by placing a Geometry-aware Token Merging module on both sides of its global attention. Within GA-merge, tokens are partitioned by the GA map, grouped and merged to reduce redundancy.

LiteVGGT’s reconstructed point clouds are used for robotic grasping. Despite minor reconstruction deviations, the accuracy is sufficient for end-side grasp execution, demonstrating the practical reliability of LiteVGGT.

Reconstruction Point Cloud Demos

BibTeX

@article{shu2025litevggt,
  title={LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging},
  author={Shu, Zhijian and Lin, Cheng and Xie, Tao and Yin, Wei and Li, Ben and Pu, Zhiyuan and Li, Weize and Yao, Yao and Cao, Xun and Guo, Xiaoyang and others},
  journal={arXiv preprint arXiv:2512.04939},
  year={2025}
}