BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representation

Abstract

Accurate LiDAR-camera calibration is the foundation of accurate multimodal fusion environmental perception for autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird's-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCALIB. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometry information from the BEV feature, we introduce a novel feature selector to choose the most important feature in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations in various datasets demonstrate that BEVCALIB establishes a new state-of-the-art. Under various noise conditions, BEVCALIB outperforms the best baseline in literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation). In the open-source domain, it improves the best reproducible baseline by one order of magnitude.

Overall Architecture

The overall pipeline of our model consists of BEV feature extraction, FPN BEV Encoder, and geometry-guided BEV decoder (GGBD). For BEV feature extraction, the inputs of the camera and LiDAR are extracted into BEV features through different backbones separately, which are then fused into a shared BEV feature space. The FPN BEV encoder is used to improve the multi-scale geometric information of the BEV representations. For geometry-guided BEV decoder utilizes a novel feature selector that efficiently decodes calibration parameters from BEV features. \(\mathcal{L}_R\) , \(\mathcal{L}_T\) , and \(\mathcal{L}_{PC}\) are rotation loss, translation loss, and lidar reprojection loss, respectively. We use total loss \(\mathcal{L} = \lambda_R \mathcal{L}_R + \lambda_T \mathcal{L}_T + \lambda_{PC}\mathcal{L}_{PC}\) to achieve end-to-end optimization.

Geometry-Guided BEV decoder(GGBD)

The GGBD component contains a feature selector (left) and a refinement module (right). The feature selector calculates the positions of BEV features using Equation \(P_\mathcal{B} = \text{Set}(\{\text{Proj}(p) | p\in P_C^W\}) \label{eq:proj}\). The corresponding positional embeddings (PE) are added to keep the geometry information of the selected feature. After the decoder, the refinement module adds an average-pooling operation to aggregate high-level information, following two separate heads to predict translation and rotation.

Experiments

Comparasion with Original Results from Literature

Table 2 and Table 3 compares BEVCalib with original results reported in the literature. Since each of the existing model was trained and evaluated using different noise settings, we group them into different clusters and evaluate BEVCalib under the same noise settings for a fair comparison. On KITTI dataset, BEVCalib has only a few centimeter translation error, outperforming the best baselines by an average of 14.29% - 78.82%, and less than 0.1\(^\circ\) rotation error, outperforming the best baselines by an average of 71.43% - 95.70% under various noise conditions. On NuScenes, BEVCalib has slightly error but still outperforms the best baseline by 78.17% in translation, 68.29% in rotation. Notably, although BEVCalib is trained under the largest noise (\(\pm\)1.5m, \(\pm\)20\(^\circ\)), it shows extremely robustness when evaluated on smaller noise, overcoming the noise sensitivity that cripples previous methods such as LCCNet. In addition, BEVCalib demonstrates remarkable rotation prediction accuracy for all three angles (roll, pitch, yaw) with error below 0.2\(^\circ\), achieving a near-perfect result that outperforms any previous methods.

Evaluation Results with Open-Source Baselines

Table 4 compares BEVCalib with the selected baselines on KITTI, NuScenes, and CaslibDB. In our exhaustive effort searching for reproducible baselines, we find that the open-source space in this LiDAR-camera calibration domain is rather scarce (very few checkpoints) and underperforming despite the abundant literature. Hence, our open-source effort will significantly boost the performance of publicly availble calibration tools. Specifically, Table 4 shows that BEVCalib outperforms the best open-source baselines by (92.75%, 89.22%) on KITTI dataset and by (92.69%, 93.62%) on NuScenes dataset, in terms of (translation, rotation), respectively. While BEVCalib approaches near-zero error on most, if not all, samples, CalibNet and Koide3 struggle with predicting the correct z-component while Regnet and CalibAnything struggle with all components on KITTI and NuScenes datasets. We apply a random noise between \([-1.5m, 1.5m]\) and \([-20^\circ, 20^\circ]\) to initial guess in this table.

Qualitative results

We visualize the predicted extrinsic calibration results by overlaying the LiDAR point cloud on the camera image. From top to bottom: ground-truth, BEVCalib, Koide3, CalibAnything, Regnet. Regnet and CalibAnything's overlays are misaligned due to the large error in rotation and translation, so the point cloud is not level with the ground. BEVCalib and Koide3 are closer to the ground-truth overlay, but there are objects where Koide3's overlay is slightly misaligned, e.g. the misaligned cars in the left column, the traffic sign in the middle column, and the pole and tree in the right column. In contrast, BEVCalib's overlays do not show these misalignments, indicating more accurate calibration results from BEVCalib across real environments.

Conclusion

In this paper, we introduce BEVCalib, the first LiDAR-camera extrinsic calibration model using BEV features. Geometry-guided BEV decoder can effectively and efficiently capture scene geometry, enhancing calibration accuracy. Results on KITTI, NuScenes, and our own indoor dataset with dynamic extrinsics illustrate that our approach establishes a new state of the art in learning-based calibration methods. Under various noise conditions, BEVCalib outperforms the best baseline in literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation) respectively. Also, BEVCalib improves the best reproducible baseline by one order of magnitude, making an important contribution to the scarce open-source space in LiDAR-camera calibration.

BibTeX

@misc{yuan2025bevcaliblidarcameracalibrationgeometryguided,
      title={BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations}, 
      author={Weiduo Yuan and Jerry Li and Justin Yue and Divyank Shah and Konstantinos Karydis and Hang Qiu},
      year={2025},
      eprint={2506.02587},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.02587}, 
}