MotionVerse: Toward Unified Human Motion and Interaction Modeling

Institute of Computing Technology, Chinese Academy of Sciences
MotionVerse task overview

Overview of the motion-related tasks unified under MotionVerse, spanning single-person and multi-person comprehension, generation, and editing.

Abstract

Existing motion-language frameworks primarily focus on isolated tasks such as single-person motion generation or understanding, while more complex scenarios, including human interaction, reaction, and motion editing, are typically addressed using task-specific architectures. Despite sharing common motion semantics and temporal structures, these heterogeneous behaviors are rarely unified within a single modeling paradigm, limiting transferable motion reasoning and general-purpose motion intelligence.

To address this limitation, we present MotionVerse, a unified motion-language framework for modeling both individual and interactive human behaviors. MotionVerse adopts a scalable multi-stream autoregressive formulation for residual motion tokens, enabling efficient long-sequence modeling while preserving fine-grained motion dynamics. To alleviate modality interference during joint optimization, we further introduce a modality-specialized Motion-Fusion mechanism with dedicated adaptation pathways for motion and language modeling.

Extensive experiments across diverse motion-language tasks demonstrate that MotionVerse achieves competitive performance in motion understanding, generation, interaction, reaction, and editing, validating its effectiveness as a scalable and general-purpose framework for unified motion modeling.

Video

Text-to-Motion (T2M)

Single-person motion generation from textual descriptions. Benchmark: MotionX.

"The person uses the doorknob to open the door."

"A worker is inspecting a pyramid."

"A man makes a Battle Motion Straight Punch."

"The person is doing squats."

"A boxer is jabbing, crossing, and throwing scissor-like punches in a rhythmic pattern."

"A person appears to be swimming."

Interactive Text-to-Motion (I-T2M)

Two-person interactive motion generation from text. Benchmarks: InterHuman, InterX.

"Both stand facing each other; one approaches the other and greets with a handshake."

"One person attacks while the other avoids; taekwondo practice with offensive and defensive movements."

"They engage in a fencing bout, where they continuously lunge and parry."

Motion-to-Text (M2T)

Single-person motion captioning. Benchmark: MotionX.

Generated: "Someone is performing a Swivel Jump To Knee Tucks F."

Generated: "A warrior executes a Battle Motion Straight Punch."

Generated: "A woman is serving the tennis ball."

Generated: "A person sits down for casual conversation."

Generated: "A person walks slightly to their left casually."

Interactive Motion-to-Text (I-M2T)

Captioning of two-person interactive motion. Benchmarks: InterHuman, InterX.

Generated: "The other person lashes out with a right leg kick at the first person's waist, forcing the first person to take a step back."

Generated: "The second one attempts to kick the first one, but the first one evades the kick by stepping back with the left leg."

Generated: "The other person rushes forward attempting to pierce one's chest with a sword. One deflects the attack by swinging their own sword."

Generated: "One person extends the left leg to launch an attack; the other pulls back their right leg to avoid."

Generated: "The two are assuming attacking positions facing each other."

Motion-to-Motion (M2M)

Single-person motion prediction and in-betweening. Benchmark: MotionX.

Timeline: GT (given frames)   Model output

Motion Prediction
Motion Prediction
Motion Prediction
Motion In-betweening
Motion In-betweening
Motion In-betweening

Interactive Motion-to-Motion (I-M2M)

Two-person motion prediction and in-betweening. Benchmarks: InterHuman, InterX.

Timeline: GT (given frames)   Model output

Motion Prediction
Motion Prediction
Motion In-betweening
Motion In-betweening
Motion In-betweening

Motion Reaction (React)

Given one person's motion, generate a plausible reaction from the other. Benchmarks: InterHuman, InterX.

■ Blue person = Input motion   ■ Red person = Generated reaction

GT: one pulls the sword back and takes a step away, while the other makes a swinging motion with the left hand.

GT: one shoves the other while the other raises both hands to express powerlessness.

GT: the two step apart and stretch their arms out wide.

GT: two humans embrace each other.

GT: one person strikes the other, who then rotates and moves forward with three steps.

Motion Editing (Edit)

Edit existing motion via natural-language instructions. Benchmark: MotionFix.

Edit Instruction: "MAKE A WIDER TURN"

Source Motion

Edited Result

Edit Instruction: "Do the same faster the second time."

Source Motion

Edited Result

Edit Instruction: "Do a handstand and keep legs open."

Source Motion

Edited Result

Edit Instruction: "Start the squat earlier."

Source Motion

Edited Result

Edit Instruction: "Take smaller careless steps instead of hasty."

Source Motion

Edited Result

Edit Instruction: "Wave a bit slower and sway less."

Source Motion

Edited Result