Wufei Ma1
Luoxin Ye1
Celso M de Melo2
Jieneng Chen1
Alan Yuille1
1Johns Hopkins University
2DEVCOM Army Research Laboratory
Recent 3D-aware VQA benchmarks demonstrated the limited 3D spatial reasoning abilities of multi-modal large language models. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. We systematically study the impact of 3D-informed data, architecture, and training setups and present SpatialLLM, a multi-modal LLM with advanced 3D spatial reasoning abilities.
Our goal is to study the best training and data recipe that builds spatially-intelligent LLMs, which can answer challenging 3D spatial reasoning questions on 3D locations and poses of objects. Therefore, we build the SpatialLLM framework with various training and data setups, as well as a SpatialVQA benchmark for thorough evaluation.
Evaluation data. To analyze the limitations of current multi-modal LLMs and to study the best recipe towards spatially-intelligent LLMs, we follow
SpatialLLM framework. We systematically study the impact of 3D-informed data, architecture, and training setups towards spatially-intelligent LLMs. Specifically, we consider the following:
Our final SpatialLLM model outperforms all previous open-sourced and proprietary models on 3D spatial reasoning questions in SpatialVQA.
The best recipe towards sptially-intelligent LLM is a combination of 3D-informed multimodal alignment using probing data and 3D-informed visual instruction tuning on synthetic spatial VQAs. => 3D awareness of visual encoder is important!
Finetuning CLIP with our 3D-informed data hurts final performance -- trade-off between 3D awareness and generalization abilities.
The data and code are released alongside our latest follow-up work, SpatialReasoner:
License. Our data and code are released under the Creative Commons Attribution 4.0 license. By accessing and using our data and code, you agree to follow the terms of access specified here.
@inproceedings{ma2025spatialllm,
title={SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models},
author={Ma, Wufei and Ye, Luoxin and de Melo, Celso and Yuille, Alan L and Chen, Jieneng},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2025}
}
This website template is adapted from Image Sculpting.