Total: 1
Few-shot font generation (FFG) aims to create new font images by imitating the style from a limited set of reference images, while maintaining the content from the source images. Although this task has achieved significant progress, most existing methods still suffer from the incorrect generation of complicated character structure and detailed font style. To address the above issues, in this paper, we regard font generation as a font transfer process from the source font to the target font, and construct a video generation framework to model this process. Moreover, a test-time condition alignment mechanism is further developed to enhance the consistency between the generated samples and the provided condition samples. Specifically, we first construct a diffusion-based image-to-image font generation framework for the few-shot font generation task. This framework is expanded into an image-to-video font generation framework by integrating temporal components and frame-index information, enabling the production of high-quality font videos that transition from the source font to the target font. Based on this framework, we develop a noise inversion mechanism in the generative process to perform content and style alignment between the generated samples and the provided condition samples, enhancing style consistency and structural accuracy. The experimental results show that our model achieves superior performance on FFG tasks, demonstrating the effectiveness of our method. We will release our code after publication.