CamC2V: Context-aware Controllable Video Generation

¹University of Bonn, ²Lamarr Institute for Machine Learning and Artificial Intelligence
International Conference on 3D Vision 2026

Abstract

Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrade visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamC2V, a context-to-video (C2V) model that integrates multiple image conditions as context with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates improvements in visual quality and camera controllability.

Problem

Problem Illustration - Limited context from single reference frame

As illustrated in the figure, the initial reference frame alone provides only limited context for the diffusion process. Once the camera pans, the visual quality degrades and arbitrary interpretations of the scene by the diffusion model become evident. To address this, we introduce CamC2V, a novel conditioning mechanism that allows users to supply multiple context views, ensuring a comprehensive definition of the scene in which the video is generated.

Method

Method overview

Image-to-video diffusion models generate videos based on a single reference frame and an optional text condition. Additionally, camera-controlled diffusion models are conditioned on a camera trajectory allowing precise control of the camera view at each timestep. The reference frame does not always provide the necessary context corresponding to the camera trajectory. This can lead to insufficient visual quality of the generated frames. In contrast, we propose a new scheme coined context-to-video which enhances the generation process with a rich context conveyed through additional context frames and their poses Our Context-aware Encoder, shown in fig:overview, extends DynamiCrafter's Dual-stream Image Injection to support multiple image conditions. Natively, it conditions the model at the pixel level by concatenating reference latents with noisy latents along the channel dimension, which restricts the generations to the narrow context provided by the reference image. Additionally, to better guide the diffusion process, semantic features aggregated from CLIP-embedded image and text conditions are integrated layer-wise through spatial cross-attention.

Results

Method overview

Our method is provided with additional posed frames, to define a general scene context. As shown in the quantitative results, our method outperforms prior work by 24.10% in terms of FVD, indicating the importance of supplying the diffusion model with missing context. Moreover, we achieve an improved CamMC by 11.24% without fine-tuning the related camera conditioning, highlighting an improved 3D consistency through the enriched context.

BibTeX

@article{denninger2026camc2v,
  title={CamC2V: Context-aware Controllable Video Generation},
  author={Luis Denninger and Sina Mokhtarzadeh Azar and Juergen Gall},
  journal={International Conference on 3D Vision},
  year={2026},
}