illustrative-example Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

1University of Illinois at Urbana-Champaign, 2NVIDIA

† Equal supervision, yman@nvidia.com

CVPR 2025

Overview

Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective.

  • Unifiying visual chain-of-thought (CoT) reasoning, visual question answering, and visual grounding with a single VLM architecture.
  • A systematic study of re-encoding and re-sampling of image tokens for text-guided understanding and detailed reasoning in high-resolution images.
  • Strong results and improvements on vision-centric reasoning tasks, visually-guided search, and document intelligence.
illustrative-example

Visual-CoT Designs

In addition to standard unconditioned visual tokenization, our method incorporates an additional goal-directed visual tokenization procedure. The model has the ability to ground most relevant region-of-interest (RoI) conditioned on the multimodal input instructions. Then, the visual RoI is sampled from the input image, and fed to the RoI re-engagement module to extract another set of visual tokens as CoT context for reasoning.

illustrative-example

Re-encoding vs. Re-sampling for Visual CoT : We compare against two designs for visual token re-engagement in the visual CoT design. The re-encoding strategy treats sampled image RoI as a new image for tokenization before appending the tokens to the input sequence, introducing a supplementary visual context that guides reasoning through explicit, context-specific signals. The re-sampling strategy utilizes visual embeddings from a memory bank, resampling the patch tokens that have intersection with boxes as context tokens for re-engagement. Two methods show distinct advantages and drawbacks.

illustrative-example illustrative-example

Findings

  • Visual CoT is important to complicated vision-centric reasoning tasks:
    • Incorporation of visual CoT reasoning consistently enhances performance.
    • Image token re-engagement - an object-centric representation and reasoning pipeline benefits the vision-language alignment process and subsequent tasks.
  • Base vision foundation is important - higher-capacity vision encoders consistently yield improved performance.
  • Design consideration
    • Two major types of RoI samplers - image re-encoding and feature re-sampling.
    • Re-encoding depends less on high-quality initial features compared to re-sampling.
    • Re-sampling has a major efficiency advantage in visual encoding operations.

Main Results

MLLM Reasoning Tasks

Argus excels in vision-centric multimodal understanding, including V-Star, CV-Bench-2D, CV-Bench-3D, text understanding, and general multimodal VQA tasks, as shown below. With a comparison model size (8B), Argus beats SOTA open models.

illustrative-example

Referring Object Grounding

Joint training with the referring grounding task is also observed to help the multimodal QA task. Argus shows great performance on the RefCOCO benchmarks, highlighting its effectiveness in general-purpose reasoning with precise visual grounding.

illustrative-example

Visualization

illustrative-example illustrative-example
example-1
example-2

More Results

For more comprehensive results and detailed analysis, including ablation studies, qualitative examples, and additional benchmarks, please refer to our full paper.

BibTeX

@inproceedings{man2025argus,
        title={Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought}, 
        author={Yunze Man and De-An Huang and Guilin Liu and Shiwei Sheng and Shilong Liu and Liang-Yan Gui and Jan Kautz and Yu-Xiong Wang and Zhiding Yu},
        booktitle={CVPR},
        year={2025}
    }