AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly

Zhi Jing^1,2, Jinbin Qiao^2,3, Ouyang Lu^2,4, Jicong Ao², Shuang Qiu⁶, Huazhe Xu⁵, Yu-Gang Jiang^1,*, Chenjia Bai^2,*

¹Fudan University^†, ²Institute of Artificial Intelligence (TeleAI), China Telecom^†,
³Tianjin University, ⁴Northwestern Polytechnical University, ⁵Tsinghua University, ⁶City University of Hong Kong

^* Equal advising | ^† Equally leading organizations

Abstract

Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. Recent methods based on vision-language models (VLMs) largely rely on coarse 2D perception and struggle to perform accurate reasoning over complex 3D geometry. To address this limitation, we propose AssemLM, a spatial multimodal large language model for robotic assembly that integrates assembly manuals, point clouds, and textual instructions to predict task-critical 6D assembly poses with explicit geometric understanding. To bridge raw 3D perception and high-level linguistic reasoning, AssemLM employs a specialized point cloud encoder to extract fine-grained geometric and rotational features for accurate 3D spatial reasoning in assembly tasks. In addition, we introduce AssemBench, a large-scale benchmark for assembly-oriented spatial reasoning with over 900K multimodal samples and precise 6D pose annotations, extending evaluation from 2D grounding to full 3D geometric inference. Extensive experiments and real-robot evaluations demonstrate that AssemLM achieves state-of-the-art 6D pose reasoning performance and effectively supports fine-grained, multi-step assembly tasks in real-world settings. Code, models, and the AssemBench dataset will be made publicly available.

AssemBench Visualization

Select one of the assets below to visualize its details in AssemBench.

Furniture

Daily Objects

Fragments

Assembly Manuals (Step 1)

Select an asset above to view assembly steps.

Nonfreestyle before — Nonfreestyle · Before

Nonfreestyle after — Nonfreestyle · After

Step Instructions (Step 1)

Select an asset above to view instructions.

Precise Instruction

Vague Instruction

Point Clouds (Step 1)

Select an asset above to view point clouds.

Blue: Assembled part Gray: Input part Red: Ground truth

Pose Tokens

9D Pose

Input Point Cloud · Assembled Part

Input Point Cloud · Assembly Part

Ground-Truth Point Cloud

Static Calibrated RGB-D Captures (Input)

Prediction Visualization

All predictions shown are from the same model (shared weights) on test-set samples that were unseen during training.

Daily Objects Predict Sample

Select a sample Category

Blue: Assembled part Gray: Input part Red: Ground truth Green: Prediction

Select a sample above to view prediction results.

Ground-Truth Pose

Translation

—

Rotation 6D

—

Predicted Pose

Translation

—

Rotation 6D

—

Input Point Cloud

Ground-Truth Point Cloud

Predicted Result

Real World Experiments Visualization

The visualization showcases real-world robot experiments, featuring paired assembly manuals, real-scene images, and step-wise point clouds (ground truth vs. prediction).

Real-world task

Step — —

Blue: Assembled part Red: Ground truth Green: Prediction

Loading real-world experiments...

Precise Instruction

—

Vague Instruction

—

Real2Sim Assembly Diagram (Before)

Real2Sim Assembly Diagram (After)

Real Scene (Before)

Real Scene (After)

Ground-Truth Point Cloud

Predicted Result

Robot Execution Video

Real2Sim Asset Preview

Citation

@article{jing2026assemlm,
  title={AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly},
  author={Jing, Zhi and Qiao, Jinbin and Lu, Ouyang and Ao, Jicong and Qiu, Shuang and Jiang, Yu-Gang and Bai, Chenjia},
  journal={arXiv preprint arXiv:2604.08983},
  year={2026}
}

AssemLM