Humans learn knowledge efficiently through the interactions with the 3D world and the integration of multi-modal information, such as verbal guidance or instructions. However, despite considerable advancements in linguistic understanding and vision-language integration, current machine learning models remain deficient in accurately perceiving and rationalizing within real-world 3D environments, which is largely attributed to the lack of 3D situational reasoning capabilities. In this work, we
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. Compared with machine learning models, humans put themselves inside the 3D world and then perceive and interact with the surrounding environment from their ego-perspective. Several existing methods recognize the lack of positional understanding in 3D and propose new benchmarks and joint optimization functions (Ma et al.), or positional embedding methods (Hong et al.) to enhance the overall reasoning performance.
However, the lack of an explicit situation modeling and situation-grounded 3D reasoning method restricts them from obtaining a generalizable and consistent 3D vision-language (VL) representation, as shown in the figure above.
We achieve leading results on ScanQA and SQA3D datasets. Our work performs significantly better than the state-of-the-art in both localization and orientation estimation tasks.
For more details please refer to our paper.
In the figure above, we provide an insightful visualization of the activation changes in 3D visual tokens, before and after undergoing our situational-guided visual re-encoding process. This visualization employs the viridis colormap, where a brighter token representation indicates a higher activation value. The effectiveness of situational guidance in amplifying the relevance of crucial tokens is evident from this depiction.
For instance, the visualization in the second row reveals a notable shift in focus. Initially, the tokens predominantly concentrate on the bed area. However, post re-encoding, there is a discernible shift in attention towards areas closely aligned with the situational vector and those directly related to the query. Similarly, in the third row, the situational re-encoding process results in the window region "on the left" receiving increased emphasis. And in the fourth row, the attention initially focuses on the vanity region. Then it shifts to the toilet on the left of the agent, as suggested by the situation vector and the question prompt. This experiment provides a clear demonstration of how our method, by harnessing enhanced situational awareness, contributes to improved performance in downstream reasoning tasks with an explainable manner. The ability of our model to dynamically adjust focus in response to situational cues is a key factor in its enhanced reasoning capabilities.
For more visualization and examples of our method, please refer to our paper.
If you find our code and paper helpful, please consider citing our work:
@inproceedings{man2024situation,
title={Situational Awareness Matters in 3D Vision Language Reasoning},
author={Man, Yunze and Gui, Liang-Yan and Wang, Yu-Xiong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}