Ctrl-V: Higher Fidelity Autonomous Vehicle Video Generation with Bounding-Box Controlled Object Motion

Ge Ya Luo · ZhiHao Luo · Anthony Gosselin · Alexia Jolicoeur-Martineau · Christopher Pal

Video

Paper PDF

Thumbnail of paper pages

Abstract

Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, developing highly accurate predictions for object motions is essential. This paper addresses the key challenge of enabling fine-grained control over object motion in the context of driving video synthesis. To accomplish this, we 1) employ a distinct, specialized model to forecast the trajectories of object bounding boxes, 2) adapt and enhance a separate video diffusion network to create video content conditioned on these high-quality trajectory forecasts, and 3) we are able to exert precise control over object position/movements using bounding boxes in both 2D and 3D spaces. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation. Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nuScenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation. Project page: \url{https://oooolga.github.io/ctrl-v.github.io/}