AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

1Fudan University, 2Institute of Artificial Intelligence (TeleAI), China Telecom,
3Tianjin University, 4Northwestern Polytechnical University, 5City University of Hong Kong
* Equal advising  |  Equally leading organizations

Abstract

Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. While recent vision-language models (VLMs) exhibit preliminary spatial awareness, they largely rely on coarse 2D perception and lack the ability to perform accurate reasoning over 3D geometry, which is crucial for precise assembly operations. To address this limitation, we propose AssemLM, a spatial multimodal large language model tailored for robotic assembly. AssemLM integrates assembly manuals, point clouds, and textual instructions to reason about and predict task-critical 6D assembly poses, enabling explicit geometric understanding throughout the assembly process. To effectively bridge raw 3D perception and high-level reasoning, we adopt a specialized point cloud encoder to capture fine-grained geometric and rotational features, which are then integrated into the multimodal language model to support accurate 3D spatial reasoning for assembly tasks. In addition, we construct AssemBench, a large-scale dataset and benchmark for assembly-oriented spatial reasoning, comprising over 900K multimodal samples with precise 6D pose annotations. AssemBench extends spatial reasoning evaluation beyond 2D and grounding tasks into full 3D geometric inference, filling a critical gap in existing embodied AI benchmarks. Extensive experiments demonstrate that AssemLM achieves state-of-the-art performance in 6D pose reasoning across diverse assembly scenarios. Furthermore, real-robot evaluations show that our model effectively supports long-horizon, fine-grained assembly execution, validating its practical applicability in real-world robotic systems.

AssemBench Visualization

Select one of the assets below to visualize its details in AssemBench.

Furniture
Daily Objects
Fragments

Assembly Manual (Step 1)

Colored before
Colored · Before
Colored after
Colored · After
Freestyle before
Freestyle · Before
Freestyle after
Freestyle · After
Lineart before
Lineart · Before
Lineart after
Lineart · After
Nonfreestyle before
Nonfreestyle · Before
Nonfreestyle after
Nonfreestyle · After

Step Instructions (Step 1)

Precise Instruction
Vague Instruction

Point Cloud (Step 1)

Blue: Assembled part Gray: Input part Red: Ground truth
Pose Tokens
9D Pose
Input Point Cloud · Assembled Part
Input Point Cloud · Assembly Part
Ground-Truth Point Cloud

Prediction Visualization

All predictions shown are from the same model (shared weights) on test-set samples that were unseen during training.

Daily Objects Predict Sample

Select a sample Category

Blue: Assembled part Gray: Input part Red: Ground truth Green: Prediction
Select a sample above to view prediction results.
Ground-Truth Pose
Translation
Rotation 6D
Predicted Pose
Translation
Rotation 6D
Input Point Cloud
Ground-Truth Point Cloud
Predicted Result

Real World Experiments Visualization

The visualization showcases real-world robot experiments, featuring paired assembly manuals, real-scene images, and step-wise point clouds (ground truth vs. prediction).

Real-world task

Step

Blue: Assembled part Red: Ground truth Green: Prediction
Loading real-world experiments...
Precise Instruction
Vague Instruction
Real2Sim Assembly Diagram (Before)
Assembly diagram before
Real2Sim Assembly Diagram (After)
Assembly diagram after
Real Scene (Before)
Real scene before
Real Scene (After)
Real scene after
Ground-Truth Point Cloud
Predicted Result
Robot Execution Video
Real2Sim Asset Preview
Real2Sim preview