Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

Stanford University

Image diffusion models encode 3D world knowledge in their latent space, which our method - ViewNeTI - leverages to do novel view synthesis from few input views.

ViewNeTI pull figure and sample novel view synthesis results.

Abstract

Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint.

ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.

3D Understanding in Diffusion Models

Text-to-image diffusion models are only trained on unposed 2D image data, yet they seem able to do 3D reasoning. Here, we ask a Stable Diffusion model to infill the background around a car, and find that it generates 3D-consistent shadows and reflections.

This motivates us to use frozen diffusion models for novel view synthesis (NVS).

Infilling StableDiffusion example

Viewpoint Neural Textual Inversion (ViewNeTI) for novel view synthesis

ViewNeTI system figure.

We adapt Textual inversion to do novel view synthesis. Given a multi-view dataset, we generate captions, "$S_{\mathbf{R}_i}$. A photo of a $S_{o}$". Here, "$S_{\mathbf{R}_i}$" is a token for camera view $\mathbf{R}_i$ and $S_o$ is a token for the object (like in standard textual inversion).

$S_{\mathbf{R}_i}$ and $S_o$ each have a small neural mapper, $ℳ_v$ and $ℳ_o$ (red in the figure), that predicts a point in the text space of the frozen CLIP text encoder. That text encoding then conditions the diffusion model to make an image of the object $S_o$ in pose $R_i$.

Our experiments do novel view synthesis with a frozen Stable Diffusion 2.1 model on the DTU multiview dataset.

Novel View Synthesis Optimized on One Scene

We can train ViewNeTI's neural mappers, $ℳ_v$ and $ℳ_o$, on a single scene. Here, there are six input views (the blue markers). We can generate novel interpolated views (the green markers), but we cannot generate novel extrapolated views (the red markers).

To do extrapolation, we pretrain $ℳ_v$ on a small multi-view dataset (<100 scenes). The pretraining dataset has different classes to our test scenes.

Single scene novel view synthesis

Single-view novel view synthesis

We do novel-view synthesis from one input view by taking the pretrained $ℳ_v$ and training a new object mapper, $ℳ_o$. Compared to NeRF-based baselines, ViewNeTI generations are photorealistic, have good semantic details, and generate reasonable completions under ambiguity.

Baseline results comparisons.

View-control in text-to-image generation

We can take the pretrained view mapper, $ℳ_v$, and add its view control token to novel text prompts. This enables controlling the 3D viewpoint in text-to-image content creation.

View controllable text generation

Related Links

DreamSparse is a concurrent work for novel view synthesis from 2+ views. Like us, they learn to condition a frozen diffusion model, which allows them to utilize the prior knowledge from the massive pretraining dataset.

Our work uses textual inversion that was developed for diffusion model personalization, but we adapt it to 3D novel view synthesis. We use ideas and code from the recent Neural Textual Inversion (NeTI) model.

BibTeX

@article{burgess2023viewneti,
  author    = {Burgess, James and Wang, Kuan-Chieh and Yeung, Serena},
  title     = {Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models},
  journal={arXiv preprint arXiv:2309.07986},
  year      = {2023},
}