Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding


University of Illinois Urbana-Champaign Carnegie Mellon University
teaser Evaluation settings (left) and major results (right) of different vision foundation models for complex 3D scene understanding. We probe visual foundation models of different input modalities and pretraining objectives, assessing their performance on multi-modal scene reasoning, grounding, segmentation, and registration tasks.
models Details of the seven evaluated visual encoding models, including their input modalities, pretraining objectives, architectures, and the training datasets used.

Overview

Complex 3D scene understanding has gained increasing attention, with scene encoding strategy playing a crucial role in its success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We assess these models across four tasks: Vision-Language Scene Reasoning, Visual Grounding, Semantic Segmentation, and Registration, each focusing on different aspects of scene understanding. We observe the following very interesting key findings:

  • Image or video foundation models achieve promising results for 3D scene understanding. Among them, DINOv2 demonstrates the best overall performance, showing strong generalizability and flexibility, which is consistent with the observation in 2D. Our evaluation further verify its capability in global and object-level 3D vision-language tasks. It can serve as a general backbone for 3D scene understanding.
  • Video models, benefiting from temporally continuous input frames, excel in object-level and geometric understanding tasks by distinguishing instances of the same semantics in a scene.
  • Visual encoders pretrained with language guidance DO NOT necessarily perform well in other language-related evaluation tasks, challenging the common practice of using such models as default encoders for vision-language reasoning tasks.
  • Generative pretrained models, beyond their well-known semantic capacity, also excel in geometrical understanding, offering new possibilities for scene understanding.

A Unified Probing Framework

pipeline Our unified probing framework to evaluate visual encoding models on various tasks.

We design a unified framework, as shown in the Figure above, to extract features from different foundation models, construct a 3D feature embedding as scene embeddings, and evaluate them on multiple downstream tasks. For a complex indoor scene, existing work usually represents it with a combination of 2D and 3D modalities. Given a complex scene represented in posed images, videos, and 3D point clouds, we extract their feature embeddings with a collection of vision foundation models. For image- and video-based models, we project their features into 3D space for the subsequent 3D scene evaluation tasks with a multi-view 3D projection module.

We visualize the scene features extracted by the vision foundation models in Figure below.

features Visualizations of extracted scene features from different visual encoders using PCA. The clear distinction between colors and patterns demonstrates the behaviors of different models.

The visualizations reveal several intuitive findings. The image models, DINOv2 and LSeg, demonstrate strong semantic understanding, with LSeg exhibiting clearer discrimination due to its pixel-level language semantic guidance. The diffusion-based models, SD and SVD, in addition to their semantic modeling, excel at preserving the local geometry and textures of the scenes, because of the generation-guided pretraining. The video models, SVD and V-JEPA, showcase a unique ability to identify different instances of the same semantic concepts, such as the two trees in the first scene and the chairs in both scenes. The 3D model, Swin3D, also exhibits strong semantic understanding. However, due to limited training data and domain shift, its quality is not on par with the image foundation models, despite being pretrained on perfect semantic annotations.

Complexity Analysis

dataset (Left) Complexity analysis of visual foundation models. (Right) Memory usage of different encoders. An ideal model should be a small circle and be positioned in the upper left.

We compare the memory usage, computation time, and model performance (vision-language reasoning on ScanQA) in the figure above. Our findings show that image encoders generally require less time to process a sample compared to video and 3D encoders. And diffusion-based models, when used for feature extraction, demand significantly more memory than other discriminative models. Noticeably, the drawbacks in running time become evident for 2D backbones, especially image encoders, when attempting to obtain a scene embedding by aggregating multi-view image embeddings. In contrast, a 3D point encoder requires significantly less time to process a scene. Nevertheless, 3D encoders exhibit relatively poor model performance, which can be attributed to the scarcity of training data. To fully demonstrate their potential in scene understanding tasks, efforts should be directed toward enhancing the generalizability of 3D foundation models. All analyses and computations were conducted on an Nvidia A100 GPU.

For more visualization and analysis and examples of our method, please refer to our paper.

BibTeX

If you find our code and paper helpful, please consider citing our work:

@article{man2024lexicon3d,
      title={Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding},
      author={Man, Yunze and Zheng, Shuhong and Bao, Zhipeng and Hebert, Martial and Gui, Liang-Yan and Wang, Yu-Xiong},
      journal={arXiv preprint arXiv:2409.03757},
      year={2024}
    }