Master's Thesis - Compositional Zero-shot Point Cloud Classification with Vision-Language Model (VLM) Embeddings

Idil Sulo

Existing 3D recognition methods show an inspirational capability to learn part generalization from seen to unseen object classes for semantic segmentation. However, it remains unexplored whether vision-language models such as Contrastive Vision-Language Pre-training (CLIP) can provide additional cues for compositional zero-shot recognition in 3D. We present a method to deduce multimodal embeddings that enable 2D to 3D zero-shot transfer and perform aggregation of view-wise CLIP embeddings directly on the point cloud. To allow the use of CLIP, we require RGB images from point clouds; therefore, we provide photo-realistic renderings of the Compositional-PartNet (C-PartNet) dataset that include additional information such as texture and color. We show that photo-realistic multi-view shape renders are more effective with CLIP than texture-less depth map projections. We surpass PointCLIP by +44.7% in top-1 classification accuracy on unseen object classes of C-PartNet. Our method demonstrates promising results across compositional zero-shot segmentation baselines without relying on 3D labeled data, and shows an improvement of +30.4% on Scissors and +14.8% on Door in mIoU, two object classes challenging for closed-set methods such as DCC to generalize.