CooperScene

Multi-Modal Cooperative Autonomy Benchmark
with C-V2X Communication Characterization

University of California, Riverside
ECCV 2026

Dataset Preview

Overview

CooperScene is the first real-world, multi-agent, multi-modal cooperative autonomy dataset with C-V2X communication characterization. It is organized into diverse real-world scenes—spanning urban intersection, freeway ramp, local street, and parking lot—that introduce different traffic and occlusion patterns. In urban intersection scnario, our data involves three connected autonomous vehicles and one instrumented infrastructure roadside unit, all equipped with multi-modal sensors and commercial C-V2X communication radios.

The dataset provides tightly aligned LiDAR, camera, GNSS/IMU, and C-V2X network data, enabling the community to jointly evaluate perception performance and the practical constraints under real-world vehicular networks. CooperScene supports 3D object detection and motion prediction benchmarks with synchronized pair-wise C-V2X throughput measurements.

59K Synchronized LiDAR Frames
53K Image Frames
344K 3D Bounding Box Labels
4 Cooperative Agents

Key Features

Multi-Agent Cooperative Scenes

High-quality interactive scenes from three vehicles and one infrastructure agent across diverse real-world urban intersection scenarios.

C-V2X Communication Traces

Real-world C-V2X network traces including latency, throughput, packet loss rate and jitter — bridging wireless networking and autonomous driving research.

Multi-Modal Sensing

LiDAR, camera, GNSS/IMU data streams per agent, well-calibrated and temporally synchronized with sub-millisecond accuracy.

Centimeter-Level Localization

GNSS-RTK positioning with spatial-temporal ICP alignment achieving an average RMSE of 0.2m across all agents.

Precise Synchronization

PTP-based time synchronization and hardware triggering ensure globally consistent timestamps across all platforms and sensors.

Unified Benchmark

Comprehensive benchmarks for detection and motion prediction evaluating both perception accuracy and communication efficiency.

Benchmark

We benchmark state-of-the-art cooperative perception methods under five cooperative settings with varying numbers of agents. All models are retrained on CooperScene.

Agent Configurations

V+I
1 vehicle + infrastructure
V+V
1 vehicle + 1 peer vehicle
V+V+I
1 vehicle + 1 peer + infra
V+2V
1 vehicle + 2 peer vehicles
V+2V+I
All 3 vehicles + infra

Cooperative 3D Object Detection

Benchmark of cooperative perception under C-V2X and unlimited (infeasible) network, with increasing number of agents, in terms of accuracy (mAP), total data sharing size per frame, and the latency if all data were to be transmitted over C-V2X. Bold marks the best result per column within each method.

Model Agents mAP (C-V2X) mAP (Unlimited) Sharing
Size (MB)
Latency (ms)
under C-V2X
@0.3 @0.5 @0.7 @0.3 @0.5 @0.7
V2VNet V+I 0.36 0.22 0.10 0.45 0.30 0.14 10.0 64,349
V+V 0.27 0.16 0.06 0.38 0.28 0.13 10.0 64,038
V+V+I 0.33 0.20 0.10 0.48 0.39 0.19 20.0 76,879
V+2V 0.27 0.16 0.06 0.50 0.50 0.26 20.0 72,773
V+2V+I 0.33 0.19 0.10 0.63 0.56 0.29 30.0 107,432
V2X-ViT V+I 0.36 0.24 0.11 0.38 0.24 0.11 0.338 2,211
V+V 0.28 0.23 0.10 0.34 0.28 0.12 0.338 1,305
V+V+I 0.33 0.28 0.11 0.40 0.33 0.14 0.667 9,215
V+2V 0.39 0.31 0.13 0.52 0.41 0.17 0.667 7,022
V+2V+I 0.37 0.30 0.12 0.53 0.42 0.16 1.014 5,976
V2VAM V+I 0.37 0.24 0.13 0.40 0.27 0.14 0.423 2,264
V+V 0.36 0.28 0.13 0.44 0.37 0.18 0.423 2,253
V+V+I 0.36 0.28 0.13 0.46 0.38 0.19 0.846 10,301
V+2V 0.43 0.35 0.18 0.63 0.57 0.33 0.846 8,117
V+2V+I 0.42 0.32 0.17 0.63 0.56 0.34 1.269 11,302
CoBEVT V+I 0.33 0.23 0.12 0.37 0.26 0.14 0.500 2,707
V+V 0.32 0.24 0.13 0.40 0.33 0.18 0.500 2,393
V+V+I 0.34 0.27 0.14 0.44 0.38 0.21 1.000 11,059
V+2V 0.37 0.31 0.17 0.56 0.51 0.32 1.000 8,987
V+2V+I 0.36 0.30 0.17 0.57 0.53 0.33 1.500 12,381

Multi-Modality Evaluation (BEVFusion Backbone)

Impact of sensing modalities under different cooperative agent configurations, built on the BEVFusion framework with LiDAR-only and LiDAR+Camera configurations.

Agent Modality mAP (BEV) mAP (3D)
@0.3 @0.5 @0.7 @0.3 @0.5 @0.7
V LiDAR 0.63 0.46 0.29 0.58 0.36 0.17
LiDAR+Camera 0.64 0.46 0.30 0.58 0.36 0.18
V+V LiDAR 0.68 0.57 0.41 0.62 0.50 0.19
LiDAR+Camera 0.72 0.58 0.43 0.64 0.53 0.22
V+2V LiDAR 0.78 0.69 0.51 0.75 0.61 0.20
LiDAR+Camera 0.79 0.70 0.54 0.77 0.64 0.32

Multi-modal feature fusion (LiDAR+Camera) consistently improves performance, especially with more cooperative agents.

Cooperative Motion Prediction (CMP)

Motion prediction performance using the perception-prediction (P&P) setting with the CMP framework.

Model Agent minADE@1s ↓ minADE@3s ↓ minADE@5s ↓ minFDE@1s ↓ minFDE@3s ↓ minFDE@5s ↓
V2VNet V 0.4895 2.1648 4.4538 0.9307 5.7344 10.6390
V+V 0.5424 1.5611 3.5591 0.7528 4.4820 9.1667
V+2V 0.7610 1.3796 2.9096 1.0079 3.4762 7.3660
CMP V 0.3851 0.8421 1.4022 0.5786 1.6125 3.0712
V+V 0.4076 0.9330 1.6416 0.6143 1.8492 3.7389
V+2V 0.3214 0.7252 1.2893 0.4723 1.4551 2.9716

Short-term predictions (1s) achieve strong accuracy, while long-horizon reasoning (5s) remains an open research challenge.

Download

CooperScene is hosted on data.ucr.edu. We provide a mini set for quick exploration and the full dataset in manageable chunks.

Mini Set

A curated mini set of scenes for quick exploration, with multi-agent LiDAR, camera, and 3D annotations.

Download Mini Set

~2.9 GB

Full Dataset

Complete CooperScene dataset split into downloadable chunks. Includes train/val/test splits with all modalities.

Download Full Dataset

~94.3 GB (in chunks)

MCAP Visualizations

Pre-packaged MCAP files for Foxglove Studio. Browse data interactively without processing raw files.

Download MCAP Files

~90.9 GB