Hot Topic

Market News

Events & Promo

Career Tips

Education News

Health & Life

PRNewswire

X Square Robot Open-Sources WALL-WM, Shifting Robot World Modeling From Chunks to Events

Publish date: 29 May 2026

Stay updated on the job market

Popular Articles

【網民熱話】AI改變職場？銀行業傳停請新人打工仔憂裁員潮

香港超越瑞士成全球最大離岸財富中心

調查：近8成本港打工仔一個月內曾感焦慮

【最新失業率】本港失業率維持3.7% 失業人數升至139,200人

【AI大軍來襲】機械人搶人類飯碗打工仔如何自保？

WALL-WM teaches robots to model meaningful physical events, not just fixed slices of time.

For a robot, the physical world does not change in neat, fixed-length clips. A task may look continuous on camera, but the important moments are discrete: approaching an object, making contact, grasping it, lifting it, moving it, and placing it down. These are the moments where the world actually changes.

SHENZHEN, China, May 29, 2026 /PRNewswire/ -- X Square Robot today announced the open-source release of WALL-WM, a World Action Model for general-purpose embodied AI. The model is designed around a simple idea: robot world models should learn from meaningful action events, rather than treating every fixed time window as equally important.

Many existing Vision-Language-Action and World Action Model systems start from multimodal or video foundation models, then train on fixed-length action chunks conditioned on the current observation and instruction. This is convenient for batching and deployment, but it creates a mismatch. Language describes goals and events, vision evolves through continuous scene dynamics, and robot actions operate at control-level timescales.

WALL-WM addresses this mismatch by organizing both supervision and data around action-grounded semantic events. These are temporally coherent executable behaviors such as reaching, grasping, lifting, moving, and placing. Each event can be described in language, observed in video, and realized through action, making it a natural bridge across modalities.

Robots learn better when tasks are divided by what physically happens

The central shift in WALL-WM is from chunk-centric optimization to event-grounded VLA pretraining. Instead of cutting robot behavior by an external clock, the model learns from segments that begin and end when the underlying executable behavior changes.

A fixed-length chunk may split one behavior in half or merge several behaviors into a single target. WALL-WM uses event captions paired with corresponding video and action segments, then trains a video-action denoiser over event-aligned intervals. In other words, events are not just labels added on top of training; they define the unit of learning itself.

The architecture keeps video priors intact while adding action understanding

WALL-WM uses a prior-aligned video-action architecture. Its video tower is inherited from a Wan series text-to-video model, preserving pretrained visual dynamics. A randomly initialized action DiT is then coupled to the video tower layer by layer.

At each block, the action stream cross-attends to matched video features without modifying the video stream. This design helps the model acquire executable action dynamics while reducing the risk of overwriting the visual-semantic prior inherited from large-scale video pretraining.

Multiple cameras can work together without becoming a generic feature mixer

For multi-view robot perception, WALL-WM extends the inherited single-view video tower to multi-view and multi-embodiment settings. It adds cross-view attention over per-frame multi-view tokens, Camera RoPE for learnable camera identity, and geometry-aware training masks.

A sight-cone attention mask restricts cross-view attention to physically plausible co-visible regions, while tube patch masking hides spatio-temporal regions in one view and forces the model to recover them from other views. Both mechanisms are used during training and removed at inference, allowing rollout to remain calibration-free.

One backbone supports both event-level planning and standard robot control

From the same event-pretrained backbone, WALL-WM supports two complementary inference modes.

In Event Mode, a VLM, human, or agent provides the next-event description, and WALL-WM executes the corresponding variable-length video-action segment before observing the next state. This lets execution follow the natural duration of the task rather than a rigid horizon.

In Unified Mode, WALL-WM retains conventional fixed-length chunk inference, but the chunk is conditioned by event-structured reasoning from a VLM with Staircase Decoding. This allows the model to guide local robot control while preserving a gradient-continuous VLA path.

The data pipeline is organized around events, behaviors, and recovery cases

WALL-WM is trained with an event-grounded data ecosystem spanning internet video, egocentric human video, robot-free UMI-style recordings, heterogeneous teleoperation data, open robot datasets, self-collected video-action data, and targeted recovery augmentation.

The data is annotated across multiple temporal scales, including Task, Subtask, Action, and Segment. This allows the model to learn not only nominal behavior, but also corrections, re-grasps, and other non-nominal trajectories that are important for real-world deployment.

The training pipeline also combines vision-language clustering with action clustering to balance diverse behaviors, scenes, and task structures, rather than letting the largest or easiest data categories dominate learning.

The training stack is built for large-scale event-grounded pretraining

To support large-scale training, WALL-WM adopts Muon-optimizer-based infrastructure and implements DMuon, a distributed realization of Muon designed for hybrid parallelism. The paper notes that a naive optimizer step can become a major bottleneck; DMuon reduces that step to a secondary cost rather than the dominant bottleneck.

WALL-WM also uses multi-event sequence packing to improve efficiency with variable-length event data. For deployment, the model applies distribution-matching distillation to reduce denoising steps and FP8 quantization to reduce per-step compute and memory cost.

Benchmarks point to stronger physical prediction and broader generalization

Experiments show that WALL-WM generalizes across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

In embodied video generation, WALL-WM improves embodied-relevant metrics such as Motion Quality, Semantic Consistency, and Physical Plausibility compared with Wan2.1 and Wan2.2, while preserving competitive visual quality.

On the CO3Dv2 3D awareness benchmark, WALL-WM achieves competitive 3D awareness, with lower point error and depth error than baselines including DINOv2, V-JEPA, CogVideoX, Aether, Open-Sora2.0, and WAN2.1-14B.

Taken together, the results suggest that event-grounded pretraining can improve both executable control and future-observation modeling.

WALL-WM demonstrates that world action modeling should not be limited to predicting the next fixed-length action chunk. For general-purpose robots, the harder question is knowing which physical events matter, how the world changes during those events, and how actions should follow.

By open-sourcing WALL-WM, X Square Robot aims to provide a practical scale-up recipe for general-purpose World Action Models and further research in event-grounded embodied AI.

Code: https://github.com/X-Square-Robot/wall-x

Project: https://x2robot.com/pages/wm

Contact: contact@x2robot.com