VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM
framework designed for large scenes. The framework comprises four
main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure,
and Dynamic Eraser. In the VIO Front End, RGB frames are processed
through dense bundle adjustment and uncertainty estimation to
extract scene geometry and poses. Based on this output, the
mapping module incrementally constructs and maintains a 2D
Gaussian map. Key components of the 2D Gaussian Map include a
Sample-based Rasterizer, Score Manager, and Pose Refinement, which
collectively improve mapping speed and localization accuracy. This
enables the SLAM system to handle large-scale urban environments
with up to 50 million Gaussian ellipsoids. To ensure global
consistency in large-scale scenes, we design a Loop Closure
module, which innovatively leverages the Novel View Synthesis
(NVS) capabilities of Gaussian Splatting for loop closure
detection and correction of the Gaussian map. Additionally, we
propose a Dynamic Eraser to address the inevitable presence of
dynamic objects in real-world outdoor scenes. Extensive
evaluations in indoor and outdoor environments demonstrate that
our approach achieves localization performance on par with
Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM
methods. It also significantly outperforms all existing methods in
terms of mapping and rendering quality. Furthermore, we developed
a mobile app and verified that our framework can generate
high-quality Gaussian maps in real time using only a smartphone
camera and a low-frequency IMU sensor. To the best of our
knowledge, VINGS-Mono is the first monocular Gaussian SLAM method
capable of operating in outdoor environments and supporting
kilometer-scale large scenes.
Given a sequence of RGB images and IMU readings, we first
utilize the Visual Inertial Front End (Sec. IV) to select
keyframes and calculate the initial depth and pose information
of the keyframes through dense bundle adjustment.
Additionally, we compute the depth map uncertainty based on
the covariance from the depth estimation process, filtering
out geometrically inaccurate regions and sky areas. The 2D
Gaussian Map module (Sec. V) incrementally adds and maintains
Gaussian ellipsoid using the outputs of the visual front end.
We designed a management mechanism based on importance scores
and error scores to effectively prune Gaussians. Furthermore,
we propose a novel method to optimize multi-frame poses using
single-frame rendering loss. To ensure scalability to large
urban-scale scenes, we implemented a CPU-GPU memory transfer
mechanism. In the NVS Loop Closure Module (Sec. VI), we
leverage the novel view synthesis capability of GS to design
an innovative loop closure detection method and correct the
Gaussian map through Gaussian-pose pair matching.
Additionally, we integrates a Dynamic Object Eraser module
(Sec. VII) that masks out transient objects like vehicles and
pedestrians, ensuring consistent and accurate mapping under
static scene assumptions.