CooperScene: Multi-Modal Cooperative Autonomy Benchmark with C-V2X

Overview

CooperScene is the first real-world, multi-agent, multi-modal cooperative autonomy dataset with C-V2X communication characterization. It is organized into diverse real-world scenes—spanning urban intersection, freeway ramp, local street, and parking lot—that introduce different traffic and occlusion patterns. In urban intersection scnario, our data involves three connected autonomous vehicles and one instrumented infrastructure roadside unit, all equipped with multi-modal sensors and commercial C-V2X communication radios.

The dataset provides tightly aligned LiDAR, camera, GNSS/IMU, and C-V2X network data, enabling the community to jointly evaluate perception performance and the practical constraints under real-world vehicular networks. CooperScene supports 3D object detection and motion prediction benchmarks with synchronized pair-wise C-V2X throughput measurements.

59K Synchronized LiDAR Frames

53K Image Frames

344K 3D Bounding Box Labels

4 Cooperative Agents

Key Features

Multi-Agent Cooperative Scenes

High-quality interactive scenes from three vehicles and one infrastructure agent across diverse real-world urban intersection scenarios.

C-V2X Communication Traces

Real-world C-V2X network traces including latency, throughput, packet loss rate and jitter — bridging wireless networking and autonomous driving research.

Multi-Modal Sensing

LiDAR, camera, GNSS/IMU data streams per agent, well-calibrated and temporally synchronized with sub-millisecond accuracy.

Centimeter-Level Localization

GNSS-RTK positioning with spatial-temporal ICP alignment achieving an average RMSE of 0.2m across all agents.

Precise Synchronization

PTP-based time synchronization and hardware triggering ensure globally consistent timestamps across all platforms and sensors.

Unified Benchmark

Comprehensive benchmarks for detection and motion prediction evaluating both perception accuracy and communication efficiency.

Benchmark

We benchmark state-of-the-art cooperative perception methods under five cooperative settings with varying numbers of agents. All models are retrained on CooperScene.

Agent Configurations

V+I

1 vehicle + infrastructure

V+V

1 vehicle + 1 peer vehicle

V+V+I

1 vehicle + 1 peer + infra

V+2V

1 vehicle + 2 peer vehicles

V+2V+I

All 3 vehicles + infra

3D Object Detection
Multi-Modal Evaluation
Motion Prediction

Cooperative 3D Object Detection

Benchmark of cooperative perception under C-V2X and unlimited (infeasible) network, with increasing number of agents, in terms of accuracy (mAP), total data sharing size per frame, and the latency if all data were to be transmitted over C-V2X. Bold marks the best result per column within each method.

Model	Agents	mAP (C-V2X)			mAP (Unlimited)			Sharing Size (MB)	Latency (ms) under C-V2X
Model	Agents	@0.3	@0.5	@0.7	@0.3	@0.5	@0.7	Sharing Size (MB)	Latency (ms) under C-V2X
V2VNet	V+I	0.36	0.22	0.10	0.45	0.30	0.14	10.0	64,349
	V+V	0.27	0.16	0.06	0.38	0.28	0.13	10.0	64,038
	V+V+I	0.33	0.20	0.10	0.48	0.39	0.19	20.0	76,879
	V+2V	0.27	0.16	0.06	0.50	0.50	0.26	20.0	72,773
	V+2V+I	0.33	0.19	0.10	0.63	0.56	0.29	30.0	107,432
V2X-ViT	V+I	0.36	0.24	0.11	0.38	0.24	0.11	0.338	2,211
	V+V	0.28	0.23	0.10	0.34	0.28	0.12	0.338	1,305
	V+V+I	0.33	0.28	0.11	0.40	0.33	0.14	0.667	9,215
	V+2V	0.39	0.31	0.13	0.52	0.41	0.17	0.667	7,022
	V+2V+I	0.37	0.30	0.12	0.53	0.42	0.16	1.014	5,976
V2VAM	V+I	0.37	0.24	0.13	0.40	0.27	0.14	0.423	2,264
	V+V	0.36	0.28	0.13	0.44	0.37	0.18	0.423	2,253
	V+V+I	0.36	0.28	0.13	0.46	0.38	0.19	0.846	10,301
	V+2V	0.43	0.35	0.18	0.63	0.57	0.33	0.846	8,117
	V+2V+I	0.42	0.32	0.17	0.63	0.56	0.34	1.269	11,302
CoBEVT	V+I	0.33	0.23	0.12	0.37	0.26	0.14	0.500	2,707
	V+V	0.32	0.24	0.13	0.40	0.33	0.18	0.500	2,393
	V+V+I	0.34	0.27	0.14	0.44	0.38	0.21	1.000	11,059
	V+2V	0.37	0.31	0.17	0.56	0.51	0.32	1.000	8,987
	V+2V+I	0.36	0.30	0.17	0.57	0.53	0.33	1.500	12,381

Multi-Modality Evaluation (BEVFusion Backbone)

Impact of sensing modalities under different cooperative agent configurations, built on the BEVFusion framework with LiDAR-only and LiDAR+Camera configurations.

Agent	Modality	mAP (BEV)			mAP (3D)
Agent	Modality	@0.3	@0.5	@0.7	@0.3	@0.5	@0.7
V	LiDAR	0.63	0.46	0.29	0.58	0.36	0.17
V	LiDAR+Camera	0.64	0.46	0.30	0.58	0.36	0.18
V+V	LiDAR	0.68	0.57	0.41	0.62	0.50	0.19
V+V	LiDAR+Camera	0.72	0.58	0.43	0.64	0.53	0.22
V+2V	LiDAR	0.78	0.69	0.51	0.75	0.61	0.20
V+2V	LiDAR+Camera	0.79	0.70	0.54	0.77	0.64	0.32

Multi-modal feature fusion (LiDAR+Camera) consistently improves performance, especially with more cooperative agents.

Cooperative Motion Prediction (CMP)

Motion prediction performance using the perception-prediction (P&P) setting with the CMP framework.

Model	Agent	minADE@1s ↓	minADE@3s ↓	minADE@5s ↓	minFDE@1s ↓	minFDE@3s ↓	minFDE@5s ↓
V2VNet	V	0.4895	2.1648	4.4538	0.9307	5.7344	10.6390
	V+V	0.5424	1.5611	3.5591	0.7528	4.4820	9.1667
	V+2V	0.7610	1.3796	2.9096	1.0079	3.4762	7.3660
CMP	V	0.3851	0.8421	1.4022	0.5786	1.6125	3.0712
	V+V	0.4076	0.9330	1.6416	0.6143	1.8492	3.7389
	V+2V	0.3214	0.7252	1.2893	0.4723	1.4551	2.9716

Short-term predictions (1s) achieve strong accuracy, while long-horizon reasoning (5s) remains an open research challenge.

Download

CooperScene is hosted on data.ucr.edu. We provide a mini set for quick exploration and the full dataset in manageable chunks.

Mini Set

A curated mini set of scenes for quick exploration, with multi-agent LiDAR, camera, and 3D annotations.

Download Mini Set

~2.9 GB

Full Dataset

Complete CooperScene dataset split into downloadable chunks. Includes train/val/test splits with all modalities.

Download Full Dataset

~94.3 GB (in chunks)

MCAP Visualizations

Pre-packaged MCAP files for Foxglove Studio. Browse data interactively without processing raw files.

Download MCAP Files

~90.9 GB

CooperScene

Multi-Modal Cooperative Autonomy Benchmark
with C-V2X Communication Characterization

Dataset Preview

Overview

Key Features

Multi-Agent Cooperative Scenes

C-V2X Communication Traces

Multi-Modal Sensing

Centimeter-Level Localization

Precise Synchronization

Unified Benchmark

Benchmark

Agent Configurations

Cooperative 3D Object Detection

Multi-Modality Evaluation (BEVFusion Backbone)

Cooperative Motion Prediction (CMP)

Download

Mini Set

Full Dataset

MCAP Visualizations

CooperScene

Multi-Modal Cooperative Autonomy Benchmarkwith C-V2X Communication Characterization

Dataset Preview

Overview

Key Features

Multi-Agent Cooperative Scenes

C-V2X Communication Traces

Multi-Modal Sensing

Centimeter-Level Localization

Precise Synchronization

Unified Benchmark

Benchmark

Agent Configurations

Cooperative 3D Object Detection

Multi-Modality Evaluation (BEVFusion Backbone)

Cooperative Motion Prediction (CMP)

Download

Mini Set

Full Dataset

MCAP Visualizations

Multi-Modal Cooperative Autonomy Benchmark
with C-V2X Communication Characterization