Visual-Inertial Gaussian Splatting

Monocular SLAM in Large Scenes

Abstract

VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.

Given a sequence of RGB images and IMU readings, we first utilize the Visual Inertial Front End (Sec. IV) to select keyframes and calculate the initial depth and pose information of the keyframes through dense bundle adjustment. Additionally, we compute the depth map uncertainty based on the covariance from the depth estimation process, filtering out geometrically inaccurate regions and sky areas. The 2D Gaussian Map module (Sec. V) incrementally adds and maintains Gaussian ellipsoid using the outputs of the visual front end. We designed a management mechanism based on importance scores and error scores to effectively prune Gaussians. Furthermore, we propose a novel method to optimize multi-frame poses using single-frame rendering loss. To ensure scalability to large urban-scale scenes, we implemented a CPU-GPU memory transfer mechanism. In the NVS Loop Closure Module (Sec. VI), we leverage the novel view synthesis capability of GS to design an innovative loop closure detection method and correct the Gaussian map through Gaussian-pose pair matching. Additionally, we integrates a Dynamic Object Eraser module (Sec. VII) that masks out transient objects like vehicles and pedestrians, ensuring consistent and accurate mapping under static scene assumptions.

Pose Map
Pose Graph Correction
Gaussian Map Correction
Captured RGB Pred RGB Pred Depth Pred Normal
Captured RGB Pred RGB Pred Depth Pred Normal
Captured RGB Pred RGB Pred Depth Pred Normal
Captured RGB Pred RGB Pred Depth Pred Normal
Captured RGB Pred RGB Pred Depth Pred Normal
Captured RGB Pred RGB Pred Depth Pred Normal
Captured RGB Pred RGB Pred Depth Pred Normal
Panorama Top-down View Number of Gaussians
Captured RGB Pred RGB Pred Depth Pred Normal
Panorama Top-down View Number of Gaussians
Captured RGB Pred RGB Pred Depth Pred Normal
Panorama Top-down View Number of Gaussians
Captured RGB Pred RGB Pred Depth Pred Normal