TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis
The goal of TEDi is to address the challenges of long-term motion generation. Traditional methods often suffer from inconsistencies when stitching diffusion-generated clips together, or they face significant latency issues.
TEDi introduces a motion buffer concept: a denoised motion sequence where the frames at the beginning are closest to reality (clean), while the frames at the end are closer to pure noise.
Motion Representation
The model represents motion using the following components for each frame:
- Root joint displacements: Movement in the xz-plane.
- Root joint height: Movement along the y-axis.
- Joint rotations: Represented using a 6D notation for J joints.
- Foot contact labels: A binary vector L∈{0,1}K×C marking whether heels or toes are in contact with the ground.
Method
During training, TEDi adds temporally varied noise to clean motion sequences. Each frame is assigned a different noise level.
- Denoising Timestep Strategy: Since DDPM uses timesteps to determine noise levels, TEDi assigns different timesteps to different frames within a sequence.
- Mixed Strategy: Training uses a mix of two strategies (2/3 probability for random allocation).
- Random allocation: Every frame in a sequence gets a random noise level. This forces the model to learn denoising from individual frames, treating a 500-frame sequence like a batch of 500 independent poses.
- Monotonically increasing: Noise levels increase gradually over time.
Direct Motion Prediction
Unlike standard diffusion models that predict noise, TEDi predicts the motion itself. This allows the use of regularization and physical losses directly in the motion space.
Loss Functions
- Loss of Position (Lpos): While motion is represented by rotations, small errors in root rotations can cause massive non-linear displacements in end effectors (like feet). TEDi uses Forward Kinematics (FK)—a differentiable function—to calculate 3D coordinates and minimize position errors directly. This helps enforce physical constraints like “feet should not pass through the floor.”
- Loss of Contact: Ensures that when foot contact labels are active, the velocity of the foot joints is near zero to prevent sliding.
- Loss of Rotation: A standard L2 loss between predicted and ground-truth rotation angles.
Inference Process
At the start of inference, the motion buffer is initialized with a sequence where noise variance increases over time.
Key Capabilities
Controllable Synthesis
- Future-Guided Motion: Users can specify future actions to influence the current motion. By replacing future frames with “noisy target actions,” the model adjusts the preceding frames to lead into that target. To ensure smoothness, frames very close to being “finished” (within 5 steps) are not forced to the target.
- Trajectory Control: The character’s path can be controlled by replacing the root joint displacements and height with a predefined trajectory.
TEDi supports various motion tasks, including:
- Motion retargeting and style transfer.
- Key-frame based generation and interactive generation.
- Music-driven synthesis and animation layering.
Results and Comparisons
The model is trained for 500k iterations on sequences of 500 frames at 30 fps. It excels at generating very long sequences compared to existing baselines:
- vs. ACRNN: While ACRNN can generate long sequences, it often suffers from “foot sliding” or floating artifacts and tends to collapse quickly after initialization.
- vs. Motion Diffusion Model (MDM): MDM does not natively support long sequences. When using “outpainting” techniques to extend motion, MDM produces significant stitching artifacts at the boundaries.
reference
-
Phase-functioned neural networks for character control.
-
Human motion diffusion model
-
AutoConditioned Recurrent Networks for Extended Complex Human Motion Synthesis
-
Character controllers using motion VAEs