FreeLong 中文解读

TL;DR

FreeLong 解决的是 pretrained video diffusion model 在长视频上常见的时序一致性衰减。它不重新训练模型，而是在 inference 阶段改造 temporal attention：通过 SpectralBlend 把低频的全局结构稳定性和高频的局部细节表达分开处理，再进行融合。直观地说，FreeLong 让模型在生成更长片段时既记得“整体在讲什么”，也不丢掉局部纹理与运动细节。

1. 问题：长视频不是短视频直接拼长

许多 video diffusion model 在短片段上效果不错，但时长增加后会出现主体漂移、背景跳变、运动节奏不连贯。这是因为模型的 temporal attention 往往没有为更长范围的结构保持而设计。

2. 核心思路：用频域视角拆解一致性

低频信息更像“全局叙事骨架”，决定身份、场景和大尺度运动是否稳定；高频信息更像“局部表现”，决定纹理、边缘和细节是否清晰。 SpectralBlend Temporal Attention 通过频域混合，让两类信息各司其职，而不是在同一个 attention 里互相干扰。

3. Key Insights：inference-time editing 也能改变生成行为

FreeLong 的价值不只在“无需训练”，还在说明 pretrained model 里已经有一部分长程能力，只是需要更合适的 attention routing 被释放出来。这类方法对快速迭代很有用：无需昂贵训练，也能测试模型内部时序结构的可塑性。

English Summary

FreeLong is a training-free method for extending pretrained video diffusion models to longer videos. It focuses on temporal attention rather than retraining the base model.

Problem

Short-video models often degrade on longer sequences through identity drift, inconsistent background, and broken motion rhythm.

Core Idea

Use SpectralBlend Temporal Attention to separate low-frequency global structure from high-frequency local details, then blend them during inference.

Practical Takeaways

Some long-horizon capability can be unlocked by routing attention differently at inference time. This makes FreeLong useful when retraining is expensive or unavailable.

Links

arXiv PDF Project

FreeLong：无需训练的长视频生成