AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

¹Fudan University^†, ²Institute of Artificial Intelligence (TeleAI), China Telecom^†,
³Tianjin University, ⁴Northwestern Polytechnical University, ⁵City University of Hong Kong

^* Equal advising | ^† Equally leading organizations

Abstract

Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. While recent vision-language models (VLMs) exhibit preliminary spatial awareness, they largely rely on coarse 2D perception and lack the ability to perform accurate reasoning over 3D geometry, which is crucial for precise assembly operations. To address this limitation, we propose AssemLM, a spatial multimodal large language model tailored for robotic assembly. AssemLM integrates assembly manuals, point clouds, and textual instructions to reason about and predict task-critical 6D assembly poses, enabling explicit geometric understanding throughout the assembly process. To effectively bridge raw 3D perception and high-level reasoning, we adopt a specialized point cloud encoder to capture fine-grained geometric and rotational features, which are then integrated into the multimodal language model to support accurate 3D spatial reasoning for assembly tasks. In addition, we construct AssemBench, a large-scale dataset and benchmark for assembly-oriented spatial reasoning, comprising over 900K multimodal samples with precise 6D pose annotations. AssemBench extends spatial reasoning evaluation beyond 2D and grounding tasks into full 3D geometric inference, filling a critical gap in existing embodied AI benchmarks. Extensive experiments demonstrate that AssemLM achieves state-of-the-art performance in 6D pose reasoning across diverse assembly scenarios. Furthermore, real-robot evaluations show that our model effectively supports long-horizon, fine-grained assembly execution, validating its practical applicability in real-world robotic systems.

AssemBench Visualization

Select one of the assets below to visualize its details in AssemBench.

Furniture

Daily Objects

Fragments

Assembly Manual (Step 1)

Select an asset above to view assembly steps.

Nonfreestyle before — Nonfreestyle · Before

Nonfreestyle after — Nonfreestyle · After

Step Instructions (Step 1)

Select an asset above to view instructions.

Precise Instruction

Vague Instruction

Point Cloud (Step 1)

Select an asset above to view point clouds.

Blue: Assembled part Gray: Input part Red: Ground truth

Pose Tokens

9D Pose

Input Point Cloud · Assembled Part

Input Point Cloud · Assembly Part

Ground-Truth Point Cloud

Static Calibrated RGB-D Captures (Input)

Prediction Visualization

All predictions shown are from the same model (shared weights) on test-set samples that were unseen during training.

Daily Objects Predict Sample

Select a sample Category

Blue: Assembled part Gray: Input part Red: Ground truth Green: Prediction

Select a sample above to view prediction results.

Ground-Truth Pose

Translation

—

Rotation 6D

—

Predicted Pose

Translation

—

Rotation 6D

—

Input Point Cloud

Ground-Truth Point Cloud

Predicted Result

Real World Experiments Visualization

The visualization showcases real-world robot experiments, featuring paired assembly manuals, real-scene images, and step-wise point clouds (ground truth vs. prediction).

Real-world task

Step — —

Blue: Assembled part Red: Ground truth Green: Prediction

Loading real-world experiments...

Precise Instruction

—

Vague Instruction

—

Real2Sim Assembly Diagram (Before)

Real2Sim Assembly Diagram (After)

Real Scene (Before)

Real Scene (After)

Ground-Truth Point Cloud

Predicted Result

Robot Execution Video

Real2Sim Asset Preview

AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

Abstract

AssemBench Visualization

Assembly Manual (Step 1) Hover to zoom

Step Instructions (Step 1)

Point Cloud (Step 1)

Static Calibrated RGB-D Captures (Input) Hover to zoom

Prediction Visualization

Daily Objects Predict Sample

Real World Experiments Visualization

Real-world task

Assembly Manual (Step 1)

Static Calibrated RGB-D Captures (Input)