TeleBoost 将视频生成的 post-training 组织为"稳定约束的优化堆栈":先用监督信号做策略塑形,再引入奖励驱动的强化学习,最后用偏好数据做更细的对齐与修正。 重点不是某个技巧,而是把长时序视频里的失败模式(误差累积、提示偏离、控制失真)当作诊断对象,按阶段逐步收敛。 从工程视角看,post-training 更像是一个高度耦合系统:输入数据结构、reward 表达能力、advantage/credit assignment、loss 细节与训练系统规模都会改变最终优化行为。
1. 为什么要把"后训练"当成系统工程
在视频生成里,单纯替换一个 RL 算法很难稳定地带来提升。真实瓶颈往往出在:奖励信号不稳定、长时序误差累积、控制目标与视觉质量之间的冲突,以及训练/采样流程的工程细节。
2. 关键直觉:先把模型"塑形",再做奖励优化,再做偏好对齐
这是一条更稳的路线:监督信号提供底层可控性与基本结构,奖励用于推动目标指标,偏好对齐用于修正细粒度主观质量与人类偏好。
3. Key Insights:post-training 更像在放大 pre-train 的上限
在相同的后训练设置下,不同预训练基座会表现出完全不同的响应:有的稳定受益,有的收益有限,甚至会出现退化。 这意味着 post-training 更像是在"放大已有能力"而非凭空创造新能力;其上限在很大程度上由 pre-train 决定。
English Summary
TeleBoost studies post-training for video generation models with a practical goal: improving controllability, prompt adherence, and long-horizon stability without sacrificing visual fidelity. The key message is that video post-training behaves like a coupled system. Stable gains depend on how data, objectives (rewards or preferences), and sequence-level credit assignment interact during optimization.
Problem
Long video generation amplifies small errors into visible failures: identity drift, spatial inconsistency, motion collapse, and gradual prompt deviation. Simple fine-tuning often trades one failure mode for another.
Core Idea
Organize post-training as a staged stack. Use supervised signals to stabilize the model and shape basic controllability. Then apply reward-driven optimization to push explicit target behaviors. Finally, use preference signals to refine subjective quality and alignment aspects that are difficult to encode as a scalar reward.
Why This Helps
Staging reduces instability. Early supervision constrains the model to produce plausible trajectories. Reward optimization then operates in a safer region, reducing reward hacking and collapse. Preference optimization provides a flexible correction layer for subtle artifacts and alignment mismatches.
Practical Takeaways
Evaluate failures by category (coherence, controllability, fidelity) and select interventions that target the dominant failure mode. In long-horizon settings, reward design and sequence-level credit assignment can matter as much as the optimizer choice.