Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. Recent methods based on vision-language models (VLMs) largely rely on coarse 2D perception and struggle to perform accurate reasoning over complex 3D geometry. To address this limitation, we propose AssemLM, a spatial multimodal large language model for robotic assembly that integrates assembly manuals, point clouds, and textual instructions to predict task-critical 6D assembly poses with explicit geometric understanding. To bridge raw 3D perception and high-level linguistic reasoning, AssemLM employs a specialized point cloud encoder to extract fine-grained geometric and rotational features for accurate 3D spatial reasoning in assembly tasks. In addition, we introduce AssemBench, a large-scale benchmark for assembly-oriented spatial reasoning with over 900K multimodal samples and precise 6D pose annotations, extending evaluation from 2D grounding to full 3D geometric inference. Extensive experiments and real-robot evaluations demonstrate that AssemLM achieves state-of-the-art 6D pose reasoning performance and effectively supports fine-grained, multi-step assembly tasks in real-world settings. Code, models, and the AssemBench dataset will be made publicly available.
Select one of the assets below to visualize its details in AssemBench.
All predictions shown are from the same model (shared weights) on test-set samples that were unseen during training.
Select a sample Category
The visualization showcases real-world robot experiments, featuring paired assembly manuals, real-scene images, and step-wise point clouds (ground truth vs. prediction).
Step — —
@article{jing2026assemlm,
title={AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly},
author={Jing, Zhi and Qiao, Jinbin and Lu, Ouyang and Ao, Jicong and Qiu, Shuang and Jiang, Yu-Gang and Bai, Chenjia},
journal={arXiv preprint arXiv:2604.08983},
year={2026}
}