视觉生成里的 RL 不是“让模型追一个分数”这么简单。图像、视频和 3D 输出都带有丰富结构,一个 reward 可能同时混合语义对齐、视觉质量、物理合理性、人类偏好和安全约束。 这篇 survey 的价值在于把这些线索系统化:先解释 RL 与 generative modeling 的接口,再整理 reward modeling、preference optimization、policy gradient、diffusion/flow matching 后训练、evaluation 与应用场景。 核心 insight 是:RL 在视觉生成中更像一种“目标组织语言”,它把难以直接写成 supervised label 的偏好转成可优化信号,但信号是否可靠、是否可分配、是否会被 reward hacking,决定了后训练上限。
1. 问题:视觉生成缺的不是更多 loss,而是更好的目标表达
Supervised training 很擅长复刻数据分布,却不总能表达“哪一个结果更好”。例如一段视频可能语义正确但运动不稳,一张图可能好看但 prompt adherence 不够。 RL 的切入点是把这些偏好写成可反馈的 reward 或 preference,让模型在生成空间里做定向改进。
2. 关键框架:reward、policy 与生成过程的接口
对 diffusion 和 video generation 来说,policy 不是传统意义上的离散动作策略,而是一个逐步生成轨迹。 因此,reward 设计只是第一步;更难的是如何把最终反馈分配到空间、时间和去噪步骤上。 这也是为什么 GRPO、DPO、reward-weighted regression、preference fine-tuning 等路线会在视觉生成中呈现不同稳定性。
3. Key Insights:后训练的瓶颈常在“信号可信度”
视觉 reward 很容易不稳定:奖励模型可能偏向审美模板,可能漏掉局部伪影,也可能把短期高分误判为长期一致性。 所以 visual RL 的核心不是盲目最大化 reward,而是判断 reward 什么时候可靠、应该作用到哪里、是否和真实人类偏好一致。 从这个角度看,好的后训练系统更像信号工程,而不只是优化器选择。
English Summary
This survey organizes the rapidly growing intersection between reinforcement learning and visual generative models. The central point is that RL provides a way to optimize goals that are difficult to encode as supervised labels, such as human preference, controllability, temporal consistency, and physical plausibility.
Problem
Visual outputs are structured and ambiguous. A single scalar reward often mixes multiple concerns, and optimizing it naively can produce reward hacking, weak localization of errors, or unstable training.
Core Idea
Treat RL as an interface between generative trajectories and feedback signals. The practical challenge is not only designing rewards, but also routing feedback across pixels, frames, denoising steps, and samples.
Practical Takeaways
Strong visual post-training systems need reliable reward modeling, careful credit assignment, and evaluation that goes beyond a single score. The survey is useful as a map of the design space and as a vocabulary for comparing different alignment methods.