Accurate LiDAR-camera calibration is the foundation of accurate multimodal fusion environmental perception for autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird's-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCALIB. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometry information from the BEV feature, we introduce a novel feature selector to choose the most important feature in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations in various datasets demonstrate that BEVCALIB establishes a new state-of-the-art. Under various noise conditions, BEVCALIB outperforms the best baseline in literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation). In the open-source domain, it improves the best reproducible baseline by one order of magnitude.
The overall pipeline of our model consists of BEV feature extraction, FPN BEV Encoder, and geometry-guided BEV decoder (GGBD). For BEV feature extraction, the inputs of the camera and LiDAR are extracted into BEV features through different backbones separately, which are then fused into a shared BEV feature space. The FPN BEV encoder is used to improve the multi-scale geometric information of the BEV representations. For geometry-guided BEV decoder utilizes a novel feature selector that efficiently decodes calibration parameters from BEV features. \(\mathcal{L}_R\) , \(\mathcal{L}_T\) , and \(\mathcal{L}_{PC}\) are rotation loss, translation loss, and lidar reprojection loss, respectively. We use total loss \(\mathcal{L} = \lambda_R \mathcal{L}_R + \lambda_T \mathcal{L}_T + \lambda_{PC}\mathcal{L}_{PC}\) to achieve end-to-end optimization.
The GGBD component contains a feature selector (left) and a refinement module (right). The feature selector calculates the positions of BEV features using Equation \(P_\mathcal{B} = \text{Set}(\{\text{Proj}(p) | p\in P_C^W\}) \label{eq:proj}\). The corresponding positional embeddings (PE) are added to keep the geometry information of the selected feature. After the decoder, the refinement module adds an average-pooling operation to aggregate high-level information, following two separate heads to predict translation and rotation.
In this paper, we introduce BEVCalib, the first LiDAR-camera extrinsic calibration model using BEV features. Geometry-guided BEV decoder can effectively and efficiently capture scene geometry, enhancing calibration accuracy. Results on KITTI, NuScenes, and our own indoor dataset with dynamic extrinsics illustrate that our approach establishes a new state of the art in learning-based calibration methods. Under various noise conditions, BEVCalib outperforms the best baseline in literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation) respectively. Also, BEVCalib improves the best reproducible baseline by one order of magnitude, making an important contribution to the scarce open-source space in LiDAR-camera calibration.
@misc{yuan2025bevcaliblidarcameracalibrationgeometryguided,
title={BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations},
author={Weiduo Yuan and Jerry Li and Justin Yue and Divyank Shah and Konstantinos Karydis and Hang Qiu},
year={2025},
eprint={2506.02587},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.02587},
}