Complex 3D scene understanding has gained increasing attention, with scene encoding strategy playing a crucial role in its success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We assess these models across four tasks: Vision-Language Scene Reasoning, Visual Grounding, Semantic Segmentation, and Registration, each focusing on different aspects of scene understanding. We observe the following very interesting key findings:
We design a unified framework, as shown in the Figure above, to extract features from different foundation models, construct a 3D feature embedding as scene embeddings, and evaluate them on multiple downstream tasks. For a complex indoor scene, existing work usually represents it with a combination of 2D and 3D modalities. Given a complex scene represented in posed images, videos, and 3D point clouds, we extract their feature embeddings with a collection of vision foundation models. For image- and video-based models, we project their features into 3D space for the subsequent 3D scene evaluation tasks with a multi-view 3D projection module.
We visualize the scene features extracted by the vision foundation models in Figure below.
The visualizations reveal several intuitive findings. The image models, DINOv2 and LSeg, demonstrate strong semantic understanding, with LSeg exhibiting clearer discrimination due to its pixel-level language semantic guidance. The diffusion-based models, SD and SVD, in addition to their semantic modeling, excel at preserving the local geometry and textures of the scenes, because of the generation-guided pretraining. The video models, SVD and V-JEPA, showcase a unique ability to identify different instances of the same semantic concepts, such as the two trees in the first scene and the chairs in both scenes. The 3D model, Swin3D, also exhibits strong semantic understanding. However, due to limited training data and domain shift, its quality is not on par with the image foundation models, despite being pretrained on perfect semantic annotations.
We compare the memory usage, computation time, and model performance (vision-language reasoning on ScanQA) in the figure above. Our findings show that image encoders generally require less time to process a sample compared to video and 3D encoders. And diffusion-based models, when used for feature extraction, demand significantly more memory than other discriminative models. Noticeably, the drawbacks in running time become evident for 2D backbones, especially image encoders, when attempting to obtain a scene embedding by aggregating multi-view image embeddings. In contrast, a 3D point encoder requires significantly less time to process a scene. Nevertheless, 3D encoders exhibit relatively poor model performance, which can be attributed to the scarcity of training data. To fully demonstrate their potential in scene understanding tasks, efforts should be directed toward enhancing the generalizability of 3D foundation models. All analyses and computations were conducted on an Nvidia A100 GPU.
For more visualization and analysis and examples of our method, please refer to our paper.
If you find our code and paper helpful, please consider citing our work:
@inproceedings{man2024lexicon3d,
title={Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding},
author={Man, Yunze and Zheng, Shuhong and Bao, Zhipeng and Hebert, Martial and Gui, Liang-Yan and Wang, Yu-Xiong},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}