Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MINDCUBEand OMNI3D-BENCH benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT4.1-mini by 12.94% on MINDCUBE. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.

Model		Performance Metrics
Method	Reference	Overall	Rotation	Among	Around
Baseline Models
Qwen2.5-VL-3B-Instruct	Bai et al. (2025)	37.81	34.00	36.00	45.20
GPT-4o	OpenAI (2024)	42.29	35.00	43.00	46.40
Spatial Mental Models
Chain-of-Thought	Yin et al. (2025)	40.48	32.00	36.00	58.00
View Interpolation	Yin et al. (2025)	37.81	35.50	36.50	42.80
Cognitive Map	Yin et al. (2025)	41.43	37.00	41.67	44.40
Visual Programming Approaches
ViperGPT	Suris et al. (2023)	36.95	20.50	41.00	40.40
VADAR	Marsili et al. (2025)	40.76	33.50	40.67	46.80
VADAR w/ Recon.	-	35.62	31.00	36.83	36.40
pySpatial (Ours)	-	62.67	41.00	65.00	66.33

Table 1: Performance comparison on the MindCube-1k dataset. The evaluated mental models are based on Qwen2.5-VL-3B-Instruct. VADAR w/ Recon. denotes that we implement VADAR with our 3D reconstruction module. The best results are highlighted in bold, and the second-best are underlined.

Example Analysis & Quantitative Results

Cambrian-1 outperforms other open-source models and achieves comparable performance with proprietary models — **Figure 2:Performance comparison on the full Mindcube dataset.** The best results are shown in bold, and the second-best are underlined. Note that we implement pySpatial using GPT-4.1-mini as the code agent for this dataset due to budget constraints.

To further illustrate the capabilities of our pySpatial framework, we conduct qualitative experiments on representative examples from the MINDCUBE benchmark. As shown in Figure 2, each query is paired with the generated 3D visual program, the reconstructed 3D scene, the program outputs, and the final response produced by pySpatial. These examples highlight how pySpatial enables MLLMs to reason explicitly within an explorable 3D scene reconstructed from sparse 2D inputs. By synthesizing executable and interpretable visual programs that perform operations such as camera translation, rotation, and novel view synthesis, the framework provides interpretable outputs that ground the reasoning process in geometric evidence. Across diverse spatial reasoning tasks, pySpatial produces responses that closely align with ground-truth annotations, highlighting the effectiveness of our approach. It is worth noting that the generated 3D visual programs include well-structured comments that capture the reasoning process of pySpatial, thereby providing transparency and interpretability that researchers can readily verify, debug, or modify

Model		Performance Metrics
Method	Reference	Overall	Rotation	Among	Around
Baseline
Random (chance)	-	32.35	36.36	32.29	30.66
Random (frequency)	-	33.02	38.30	32.66	35.79
Open-Weight Multi-Image Models
LLaVA-OneVision-7B	Li et al. (2025)	47.43	36.45	48.42	44.09
LLaVA-Video-Qwen-7B	Zhang et al. (2025)	41.96	35.71	43.55	30.12
mPLUG-Owl3-7B-241101	Ye et al. (2025)	44.85	37.84	47.11	26.91
InternVL2.5-8B	Chen et al. (2024)	18.68	36.45	18.20	13.11
Qwen2.5-VL-7B-Instruct	Bai et al. (2025)	29.26	38.76	29.50	21.35
Qwen2.5-VL-3B-Instruct	Bai et al. (2025)	33.21	37.37	33.26	30.34
DeepSeek-VL2-Small	Lu et al. (2024)	47.62	37.00	50.38	26.91
Proprietary Models
GPT-4o	OpenAI (2024)	38.81	32.65	40.17	29.16
GPT-4.1-mini	OpenAI (2025)	45.62	37.84	47.22	34.56
Claude-4-Sonnet	Anthropic (2025)	44.75	48.42	44.21	47.62
Specialized Spatial Models
RoboBrain	Ji et al. (2025)	37.38	35.80	38.28	29.53
SpaceMantis	Chen et al. (2024)	22.81	37.65	21.26	29.32
Spatial-MLLM	Wu et al. (2025)	32.06	38.39	20.92	32.82
Space-Qwen	Chen et al. (2024)	33.28	38.02	33.71	26.32
VLM-3R	Fan et al. (2025)	42.09	36.73	44.22	24.45
pySpatial (Ours)	-	58.56	43.20	60.54	48.10

Table 2: Performance comparison on the full MindCube dataset. The best results are shown in bold, and the second-best are underlined. Note that we implement pySpatial using GPT-4.1-mini as the code agent for this dataset due to budget constraints.

Real-World Application

To test the potential of real-world deployment using purely MLLMs, we employ a quadrupedal robot with a velocity-tracking controller in a 50 square meter two-room laboratory. In this setup, the MLLM generates high-level position commands, which are manually converted into temporal velocity targets that the controller tracks, enabling the robot to navigate from an initial pose to a target object (a mushroom toy). From limited 2D views, pySpatial reconstructs an explorable 3D scene, infers camera poses via visual programming, and generates a structured motion plan for the robot to execute.

As shown in Figure 3, our pySpatial successfully guides the robot through doorways, make correct turns, and finally toward the correct goal location. Notably, the MLLM baseline GPT-4.1 struggles to resolve relative direction such as left-right and fails to provide absolute metric distance estimates, leading to navigation errors. In contrast, our agent outputs precise rotations and translations that align with real-world execution, resulting in reliable task completion. This experiment demonstrates that our approach not only produces coherent spatial reasoning in question answering benchmarks, but also transfers effectively to physical robotic platforms for complex indoor navigation tasks.

Conclusion

In this work, we present pySpatial, a visual programming framework that enhance spatial reasoning capabilities of MLLMs through zero-shot Python code generation. By composing functions such as 3D reconstruction and novel-view synthesis, pySpatial converts 2D image sequences into explorable 3D scenes, enabling explicit reasoning in 3D space. Experiments on the MINDCUBEand OMNI3D-BENCH benchmarks demonstrate that pySpatial consistently outperforms strong MLLM baselines, with gains of up to 12.94% on MINDCUBE compared to GPT-4.1-mini. Beyond benchmarks, real-world indoor navigation experiments further validate its practicality, showing that robots can successfully traverse complex environments using route plans generated by pySpatial.

pySpatial

Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Example Analysis & Quantitative Results

Real-World Application

Conclusion