AssemLM

A Spatial Reasoning Multimodal Large Language Model
for Robotic Assembly

1Fudan University, 2Institute of Artificial Intelligence (TeleAI), China Telecom,
3Tianjin University, 4Northwestern Polytechnical University, 5Tsinghua University, 6City University of Hong Kong
* Equal advising  |  Equally leading organizations

AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly

1Fudan University, 2Institute of Artificial Intelligence (TeleAI), China Telecom,
3Tianjin University, 4Northwestern Polytechnical University, 5Tsinghua University, 6City University of Hong Kong
* Equal advising  |  Equally leading organizations

Abstract

Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. Recent methods based on vision-language models (VLMs) largely rely on coarse 2D perception and struggle to perform accurate reasoning over complex 3D geometry. To address this limitation, we propose AssemLM, a spatial multimodal large language model for robotic assembly that integrates assembly manuals, point clouds, and textual instructions to predict task-critical 6D assembly poses with explicit geometric understanding. To bridge raw 3D perception and high-level linguistic reasoning, AssemLM employs a specialized point cloud encoder to extract fine-grained geometric and rotational features for accurate 3D spatial reasoning in assembly tasks. In addition, we introduce AssemBench, a large-scale benchmark for assembly-oriented spatial reasoning with over 900K multimodal samples and precise 6D pose annotations, extending evaluation from 2D grounding to full 3D geometric inference. Extensive experiments and real-robot evaluations demonstrate that AssemLM achieves state-of-the-art 6D pose reasoning performance and effectively supports fine-grained, multi-step assembly tasks in real-world settings. Code, models, and the AssemBench dataset will be made publicly available.

AssemBench Visualization

Select one of the assets below to visualize its details in AssemBench.

Furniture
Daily Objects
Fragments

Assembly Manuals (Step 1)

Colored before
Colored · Before
Colored after
Colored · After
Freestyle before
Freestyle · Before
Freestyle after
Freestyle · After
Lineart before
Lineart · Before
Lineart after
Lineart · After
Nonfreestyle before
Nonfreestyle · Before
Nonfreestyle after
Nonfreestyle · After

Step Instructions (Step 1)

Precise Instruction
Vague Instruction

Point Clouds (Step 1)

Blue: Assembled part Gray: Input part Red: Ground truth
Pose Tokens
9D Pose
Input Point Cloud · Assembled Part
Input Point Cloud · Assembly Part
Ground-Truth Point Cloud

Prediction Visualization

All predictions shown are from the same model (shared weights) on test-set samples that were unseen during training.

Daily Objects Predict Sample

Select a sample Category

Blue: Assembled part Gray: Input part Red: Ground truth Green: Prediction
Select a sample above to view prediction results.
Ground-Truth Pose
Translation
Rotation 6D
Predicted Pose
Translation
Rotation 6D
Input Point Cloud
Ground-Truth Point Cloud
Predicted Result

Real World Experiments Visualization

The visualization showcases real-world robot experiments, featuring paired assembly manuals, real-scene images, and step-wise point clouds (ground truth vs. prediction).

Real-world task

Step

Blue: Assembled part Red: Ground truth Green: Prediction
Loading real-world experiments...
Precise Instruction
Vague Instruction
Real2Sim Assembly Diagram (Before)
Assembly diagram before
Real2Sim Assembly Diagram (After)
Assembly diagram after
Real Scene (Before)
Real scene before
Real Scene (After)
Real scene after
Ground-Truth Point Cloud
Predicted Result
Robot Execution Video
Real2Sim Asset Preview
Real2Sim preview

Citation

@article{jing2026assemlm,
  title={AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly},
  author={Jing, Zhi and Qiao, Jinbin and Lu, Ouyang and Ao, Jicong and Qiu, Shuang and Jiang, Yu-Gang and Bai, Chenjia},
  journal={arXiv preprint arXiv:2604.08983},
  year={2026}
}