Object-level Scene Deocclusion

A foundation model for category-agnostic object deocclusion
SIGGRAPH 2024
Zhengzhe Liu1, Qing Liu2, Chirui Chang3, Jianming Zhang2, Daniil Pakhomov2,
Haitian Zheng2, Zhe Lin2, Daniel Cohen-Or4, Chi-Wing Fu1
1The Chinese University of Hong Kong 2Adobe 3The University of Hong Kong 4Tel-Aviv University


Abstract

Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene de-occlusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition.

Method

Overview of our PACO framework. (a) In the first training stage, we train the Parallel Variational Autoencoder {𝐸1, 𝐷1} to learn to encode a stack of complete (full-view) objects {𝑂𝑖 } into full-view feature map ˆ𝑓 and the decoder 𝐷1 to reconstruct the specific object 𝑂𝑖 for the partial query mask 𝑚𝑖 . (b) In the second training stage, we train the Visible-to-Complete Latent Generator to generate full-view feature map 𝑓 conditioned on the partial-view features map 𝑓𝑝 from only segmented visible objects. (c) At inference, we employ the visible-to-complete latent generator to generate full-view feature map 𝑓 conditioned on partial-view feature map 𝑓𝑝 encoded from partial objects, then use 𝐷1 to recover the amodal appearance ̃𝑂𝑖 with the partial mask 𝑚𝑖 as the query.

Results

Given an input image (a), our PACO deoccludes objects in it (b), enabling downstream applications including image recomposition (c), single-view 3D scene reconstruction (d), and 3D recomposition (e).

Additional Results

Reference

@article{liu2023object,
  title={Object-level Scene Deocclusion},
  author={Liu, Zhengzhe and Liu, Qing and Chang, Chirui and Zhang, Jianming and Pakhomov, Daniil and
Zheng, Haitian and Lin, Zhe and  Cohen-Or, Daniel and Fu, Chi-Wing},
  journal={SIGGRAPH 2024 Conference Papers},
  year={2024}