TeleBoost 中文解读

TL;DR

TeleBoost 将视频生成的 post-training 组织为"稳定约束的优化堆栈"：先用监督信号做策略塑形，再引入奖励驱动的强化学习，最后用偏好数据做更细的对齐与修正。重点不是某个技巧，而是把长时序视频里的失败模式（误差累积、提示偏离、控制失真）当作诊断对象，按阶段逐步收敛。从工程视角看，post-training 更像是一个高度耦合系统：输入数据结构、reward 表达能力、advantage/credit assignment、loss 细节与训练系统规模都会改变最终优化行为。

1. 为什么要把"后训练"当成系统工程

在视频生成里，单纯替换一个 RL 算法很难稳定地带来提升。真实瓶颈往往出在：奖励信号不稳定、长时序误差累积、控制目标与视觉质量之间的冲突，以及训练/采样流程的工程细节。

2. 关键直觉：先把模型"塑形"，再做奖励优化，再做偏好对齐

这是一条更稳的路线：监督信号提供底层可控性与基本结构，奖励用于推动目标指标，偏好对齐用于修正细粒度主观质量与人类偏好。

3. Key Insights：post-training 更像在放大 pre-train 的上限

在相同的后训练设置下，不同预训练基座会表现出完全不同的响应：有的稳定受益，有的收益有限，甚至会出现退化。这意味着 post-training 更像是在"放大已有能力"而非凭空创造新能力；其上限在很大程度上由 pre-train 决定。

English Summary

TeleBoost studies post-training for video generation models with a practical goal: improving controllability, prompt adherence, and long-horizon stability without sacrificing visual fidelity. The key message is that video post-training behaves like a coupled system. Stable gains depend on how data, objectives (rewards or preferences), and sequence-level credit assignment interact during optimization.

Problem

Long video generation amplifies small errors into visible failures: identity drift, spatial inconsistency, motion collapse, and gradual prompt deviation. Simple fine-tuning often trades one failure mode for another.

Core Idea

Organize post-training as a staged stack. Use supervised signals to stabilize the model and shape basic controllability. Then apply reward-driven optimization to push explicit target behaviors. Finally, use preference signals to refine subjective quality and alignment aspects that are difficult to encode as a scalar reward.

Why This Helps

Staging reduces instability. Early supervision constrains the model to produce plausible trajectories. Reward optimization then operates in a safer region, reducing reward hacking and collapse. Preference optimization provides a flexible correction layer for subtle artifacts and alignment mismatches.

Practical Takeaways

Evaluate failures by category (coherence, controllability, fidelity) and select interventions that target the dominant failure mode. In long-horizon settings, reward design and sequence-level credit assignment can matter as much as the optimizer choice.

Links

arXiv PDF Project

TeleBoost：视频生成的系统化后训练与对齐