SpatialLLM

Recent 3D-aware VQA benchmarks demonstrated the limited 3D spatial reasoning abilities of multi-modal large language models. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. We systematically study the impact of 3D-informed data, architecture, and training setups and present SpatialLLM, a multi-modal LLM with advanced 3D spatial reasoning abilities.

ImageNet3D overview — **Figure 1.** Overview of our SpatialLLM framework.

SpatialLLM Framework

Our goal is to study the best training and data recipe that builds spatially-intelligent LLMs, which can answer challenging 3D spatial reasoning questions on 3D locations and poses of objects. Therefore, we build the SpatialLLM framework with various training and data setups, as well as a SpatialVQA benchmark for thorough evaluation.

Evaluation data. To analyze the limitations of current multi-modal LLMs and to study the best recipe towards spatially-intelligent LLMs, we follow build a SpatialVQA benchmark based on 3D bounding box annotations in Omni3D . Compared to previous benchmarks , our SpatialVQA focuses on 3D spatial reasoning questions about 3D location and pose of objects, which cannot be easily answered from 2D cues, e.g., 2D bounding boxes.

SpatialLLM framework. We systematically study the impact of 3D-informed data, architecture, and training setups towards spatially-intelligent LLMs. Specifically, we consider the following:

Training data. We consider two types of training data, 3D-informed probing data and 3D-informed finetuning data. The probing data consists of basic 3D knowledge, such as depth, distance, and pose, while the finetuning data includes spatial VQAs that require certain reasoning.
Architecture. We study the design of visual encoder to enable better 3D spatial reasoning: (1) pretrained visual encoders, (2) mixed encoders, and (3) finetuned 3D-informed visual encoders.
Training setup. We extend the LLaVA framework and consider three stages of training: (1) stage 0 (pre-pretraining): use 3D-informed probing data to tune the Lora layers of CLIP vision encoder; (2) stage 1: 3D-informed multimodal alignment using a combination of standard CC558K and our 3D-informed probing data; and (3) stage 2: 3D-informed instruction tuning with the standard Mix665K and our 3D-informed visual instruction tuning data.

Key Findings

Our final SpatialLLM model outperforms all previous open-sourced and proprietary models on 3D spatial reasoning questions in SpatialVQA.

The best recipe towards sptially-intelligent LLM is a combination of 3D-informed multimodal alignment using probing data and 3D-informed visual instruction tuning on synthetic spatial VQAs. => 3D awareness of visual encoder is important!

Finetuning CLIP with our 3D-informed data hurts final performance -- trade-off between 3D awareness and generalization abilities.

Data & Code

The data and code are released alongside our latest follow-up work, SpatialReasoner:

A comprehensive 3D spatial reasoning benchmark with (1) 2,100 manually annotated questions on real images, and (2) 12 diverse 3D spatial reasoning question types about object height, 3D location, 3D pose, spatial relationships, etc.
Code to generate pseudo 3D groundtruths and synthetic spatial VQA, including challenging spatial reasoning questions on object 3D poses.
Pre-generated 3D spatial VQA training data and finetuned SpatialReasoner models.
Recipe to finetune a Qwen2.5-VL model with SFT or RL on 3D spatial reasoning questions, which achieves state-of-the-art performance on 3DSRBench.

Miscellaneous

License. Our data and code are released under the Creative Commons Attribution 4.0 license. By accessing and using our data and code, you agree to follow the terms of access specified here.

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

CVPR 2025 (highlight)

SpatialLLM Framework

Key Findings

Data & Code

Miscellaneous

BibTeX

Notes