Situational Awareness Matters in 3D Vision Language Reasoning

Overview

Humans learn knowledge efficiently through the interactions with the 3D world and the integration of multi-modal information, such as verbal guidance or instructions. However, despite considerable advancements in linguistic understanding and vision-language integration, current machine learning models remain deficient in accurately perceiving and rationalizing within real-world 3D environments, which is largely attributed to the lack of 3D situational reasoning capabilities. In this work, we

recognize the lack of situational awareness as a significant oversight in existing research, and introduce SIG3D, a situation-grounded 3D vision-language reasoning architecture, specifically designed to fill this void.
We propose an anchor-based approach to situational estimation, which effectively narrows the extensive search space in 3D environments for precise grounding of 3D positions and orientations with textual descriptions. Additionally, we investigate situational alignment and visual re-encoding mechanisms to leverage situational awareness for enhanced QA performance.
Our model demonstrates superior performance on two challenging datasets, SQA3D and ScanQA, surpassing the state of the art in both situational estimation and QA metrics.

Motivation: Situation Awareness is Missing in Existing Methods

Left: Situational estimation in existing method fails in most scenarios, indicating the missing registration between the situational description and 3D embeddings. Red: Ground truth (GT) vector. Blue: Estimated vector. Right: Results on variants of the representative SQA3D baseline demonstrate that situational understanding makes negligible contribution.

Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. Compared with machine learning models, humans put themselves inside the 3D world and then perceive and interact with the surrounding environment from their ego-perspective. Several existing methods recognize the lack of positional understanding in 3D and propose new benchmarks and joint optimization functions (Ma et al.), or positional embedding methods (Hong et al.) to enhance the overall reasoning performance.

However, the lack of an explicit situation modeling and situation-grounded 3D reasoning method restricts them from obtaining a generalizable and consistent 3D vision-language (VL) representation, as shown in the figure above.

Left figure: Situational estimation in existing method fails in most scenarios, indicating the missing registration between the situational description and 3D embeddings. Red: Ground truth (GT) vector. Blue: Estimated vector;
Right figure: Results on variants of the representative SQA3D baseline method demonstrate that situational understanding, despite being indispensable in perceiving the context of questions, makes negligible contribution in existing methods. This motivates a situation-guided 3D encoding mechanism in our model.

SIG3D: Situation-Grounded 3D Vision-Language Model

SIG3D includes 3D scene and text encoding, anchor-based situational estimation, situation-guided visual re-encoding, and multi-modal decoder modules. We tokenize the 3D scene into voxels, treat each token as an anchor point, and query the text tokens to predict a token-level position likelihood and rotation matrix to locate the situational vector associated with the text descriptions. Then we update the scene tokens with situational position encoding (PE), and finally perform the 3DVL reasoning task with a large transformer decoder.

We achieve leading results on ScanQA and SQA3D datasets. Our work performs significantly better than the state-of-the-art in both localization and orientation estimation tasks.

For more details please refer to our paper.

Visualization of Change in Attention with Situation Awareness

3D vision token activation before and after situational re-encoding. We can notice that higher weights are assigned to question and situation-related tokens after our proposed situational re-encoding mechanism.

In the figure above, we provide an insightful visualization of the activation changes in 3D visual tokens, before and after undergoing our situational-guided visual re-encoding process. This visualization employs the viridis colormap, where a brighter token representation indicates a higher activation value. The effectiveness of situational guidance in amplifying the relevance of crucial tokens is evident from this depiction.

For instance, the visualization in the second row reveals a notable shift in focus. Initially, the tokens predominantly concentrate on the bed area. However, post re-encoding, there is a discernible shift in attention towards areas closely aligned with the situational vector and those directly related to the query. Similarly, in the third row, the situational re-encoding process results in the window region "on the left" receiving increased emphasis. And in the fourth row, the attention initially focuses on the vanity region. Then it shifts to the toilet on the left of the agent, as suggested by the situation vector and the question prompt. This experiment provides a clear demonstration of how our method, by harnessing enhanced situational awareness, contributes to improved performance in downstream reasoning tasks with an explainable manner. The ability of our model to dynamically adjust focus in response to situational cues is a key factor in its enhanced reasoning capabilities.

BibTeX

If you find our code and paper helpful, please consider citing our work:

@inproceedings{man2024situation, title={Situational Awareness Matters in 3D Vision Language Reasoning}, author={Man, Yunze and Gui, Liang-Yan and Wang, Yu-Xiong}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2024} }