CooperScene is the first real-world, multi-agent, multi-modal cooperative autonomy dataset with C-V2X communication characterization. It is organized into diverse real-world scenes—spanning urban intersection, freeway ramp, local street, and parking lot—that introduce different traffic and occlusion patterns. In urban intersection scnario, our data involves three connected autonomous vehicles and one instrumented infrastructure roadside unit, all equipped with multi-modal sensors and commercial C-V2X communication radios.
The dataset provides tightly aligned LiDAR, camera, GNSS/IMU, and C-V2X network data, enabling the community to jointly evaluate perception performance and the practical constraints under real-world vehicular networks. CooperScene supports 3D object detection and motion prediction benchmarks with synchronized pair-wise C-V2X throughput measurements.
High-quality interactive scenes from three vehicles and one infrastructure agent across diverse real-world urban intersection scenarios.
Real-world C-V2X network traces including latency, throughput, packet loss rate and jitter — bridging wireless networking and autonomous driving research.
LiDAR, camera, GNSS/IMU data streams per agent, well-calibrated and temporally synchronized with sub-millisecond accuracy.
GNSS-RTK positioning with spatial-temporal ICP alignment achieving an average RMSE of 0.2m across all agents.
PTP-based time synchronization and hardware triggering ensure globally consistent timestamps across all platforms and sensors.
Comprehensive benchmarks for detection and motion prediction evaluating both perception accuracy and communication efficiency.
We benchmark state-of-the-art cooperative perception methods under five cooperative settings with varying numbers of agents. All models are retrained on CooperScene.
Benchmark of cooperative perception under C-V2X and unlimited (infeasible) network, with increasing number of agents, in terms of accuracy (mAP), total data sharing size per frame, and the latency if all data were to be transmitted over C-V2X. Bold marks the best result per column within each method.
| Model | Agents | mAP (C-V2X) | mAP (Unlimited) | Sharing Size (MB) |
Latency (ms) under C-V2X |
||||
|---|---|---|---|---|---|---|---|---|---|
| @0.3 | @0.5 | @0.7 | @0.3 | @0.5 | @0.7 | ||||
| V2VNet | V+I | 0.36 | 0.22 | 0.10 | 0.45 | 0.30 | 0.14 | 10.0 | 64,349 |
| V+V | 0.27 | 0.16 | 0.06 | 0.38 | 0.28 | 0.13 | 10.0 | 64,038 | |
| V+V+I | 0.33 | 0.20 | 0.10 | 0.48 | 0.39 | 0.19 | 20.0 | 76,879 | |
| V+2V | 0.27 | 0.16 | 0.06 | 0.50 | 0.50 | 0.26 | 20.0 | 72,773 | |
| V+2V+I | 0.33 | 0.19 | 0.10 | 0.63 | 0.56 | 0.29 | 30.0 | 107,432 | |
| V2X-ViT | V+I | 0.36 | 0.24 | 0.11 | 0.38 | 0.24 | 0.11 | 0.338 | 2,211 |
| V+V | 0.28 | 0.23 | 0.10 | 0.34 | 0.28 | 0.12 | 0.338 | 1,305 | |
| V+V+I | 0.33 | 0.28 | 0.11 | 0.40 | 0.33 | 0.14 | 0.667 | 9,215 | |
| V+2V | 0.39 | 0.31 | 0.13 | 0.52 | 0.41 | 0.17 | 0.667 | 7,022 | |
| V+2V+I | 0.37 | 0.30 | 0.12 | 0.53 | 0.42 | 0.16 | 1.014 | 5,976 | |
| V2VAM | V+I | 0.37 | 0.24 | 0.13 | 0.40 | 0.27 | 0.14 | 0.423 | 2,264 |
| V+V | 0.36 | 0.28 | 0.13 | 0.44 | 0.37 | 0.18 | 0.423 | 2,253 | |
| V+V+I | 0.36 | 0.28 | 0.13 | 0.46 | 0.38 | 0.19 | 0.846 | 10,301 | |
| V+2V | 0.43 | 0.35 | 0.18 | 0.63 | 0.57 | 0.33 | 0.846 | 8,117 | |
| V+2V+I | 0.42 | 0.32 | 0.17 | 0.63 | 0.56 | 0.34 | 1.269 | 11,302 | |
| CoBEVT | V+I | 0.33 | 0.23 | 0.12 | 0.37 | 0.26 | 0.14 | 0.500 | 2,707 |
| V+V | 0.32 | 0.24 | 0.13 | 0.40 | 0.33 | 0.18 | 0.500 | 2,393 | |
| V+V+I | 0.34 | 0.27 | 0.14 | 0.44 | 0.38 | 0.21 | 1.000 | 11,059 | |
| V+2V | 0.37 | 0.31 | 0.17 | 0.56 | 0.51 | 0.32 | 1.000 | 8,987 | |
| V+2V+I | 0.36 | 0.30 | 0.17 | 0.57 | 0.53 | 0.33 | 1.500 | 12,381 | |
Impact of sensing modalities under different cooperative agent configurations, built on the BEVFusion framework with LiDAR-only and LiDAR+Camera configurations.
| Agent | Modality | mAP (BEV) | mAP (3D) | ||||
|---|---|---|---|---|---|---|---|
| @0.3 | @0.5 | @0.7 | @0.3 | @0.5 | @0.7 | ||
| V | LiDAR | 0.63 | 0.46 | 0.29 | 0.58 | 0.36 | 0.17 |
| LiDAR+Camera | 0.64 | 0.46 | 0.30 | 0.58 | 0.36 | 0.18 | |
| V+V | LiDAR | 0.68 | 0.57 | 0.41 | 0.62 | 0.50 | 0.19 |
| LiDAR+Camera | 0.72 | 0.58 | 0.43 | 0.64 | 0.53 | 0.22 | |
| V+2V | LiDAR | 0.78 | 0.69 | 0.51 | 0.75 | 0.61 | 0.20 |
| LiDAR+Camera | 0.79 | 0.70 | 0.54 | 0.77 | 0.64 | 0.32 | |
Multi-modal feature fusion (LiDAR+Camera) consistently improves performance, especially with more cooperative agents.
Motion prediction performance using the perception-prediction (P&P) setting with the CMP framework.
| Model | Agent | minADE@1s ↓ | minADE@3s ↓ | minADE@5s ↓ | minFDE@1s ↓ | minFDE@3s ↓ | minFDE@5s ↓ |
|---|---|---|---|---|---|---|---|
| V2VNet | V | 0.4895 | 2.1648 | 4.4538 | 0.9307 | 5.7344 | 10.6390 |
| V+V | 0.5424 | 1.5611 | 3.5591 | 0.7528 | 4.4820 | 9.1667 | |
| V+2V | 0.7610 | 1.3796 | 2.9096 | 1.0079 | 3.4762 | 7.3660 | |
| CMP | V | 0.3851 | 0.8421 | 1.4022 | 0.5786 | 1.6125 | 3.0712 |
| V+V | 0.4076 | 0.9330 | 1.6416 | 0.6143 | 1.8492 | 3.7389 | |
| V+2V | 0.3214 | 0.7252 | 1.2893 | 0.4723 | 1.4551 | 2.9716 |
Short-term predictions (1s) achieve strong accuracy, while long-horizon reasoning (5s) remains an open research challenge.
CooperScene is hosted on data.ucr.edu. We provide a mini set for quick exploration and the full dataset in manageable chunks.
A curated mini set of scenes for quick exploration, with multi-agent LiDAR, camera, and 3D annotations.
Download Mini Set~2.9 GB
Complete CooperScene dataset split into downloadable chunks. Includes train/val/test splits with all modalities.
Download Full Dataset~94.3 GB (in chunks)
Pre-packaged MCAP files for Foxglove Studio. Browse data interactively without processing raw files.
Download MCAP Files~90.9 GB