E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

Yifan Gong^1,2 Zheng Zhan² Qing Jin¹ Yanyu Li^1,2 Yerlan Idelbayev¹ Xian Liu¹ Andrey Zharkov¹ Kfir Aberman¹ Sergey Tulyakov¹ Yanzhi Wang² Jian Ren¹

¹Snap Inc. ²Northeastern University

Paper arXiv Demo Github

Overview of E²GAN. Left: Training Comparison. Conventional GAN training, such as pix2pix [1] and pix2pix-zero-distilled that distills Co-Mod-GAN [2] using data from pix2pix-zero [3], requires all the weights trained from scratch, while our efficient training significant reduces the training cost by only fine-tuning 1% weights with only portion of training data. Right: Mobile Inference Compari- son. Our efficient on-device model can achieve real-time (30FPS, iPhone 14) runtime and is faster than pix2pix and diffusion model, while the pix2pix-zero-distilled model (Co-Mod-GAN) is not supported on device.

Demo Video

We present a short demo video, mostly with quick overview of motivations, our framework design, and visualization results.

Abstract

One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models, such as Stable Diffusion, to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with diffusion models. However, unlike text-to-image diffusion models, each distilled GAN is specialized for a specific image editing task, necessitating costly training efforts to obtain models for various concepts. In this work, we introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient? To achieve this goal, we propose a series of innovative techniques.First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch. Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model. Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time. Extensive experiments show that we can efficiently empower GANs with the ability to perform real-time high-quality image editing on mobile devices with remarkable reduced training cost and storage for each concept.

Model Architecture Overview

Overview of E²GAN model architecture. The generator is composed of down/up-sampling layers, 3 ResNet Blocks, and 1 Transformer Block. The base generator is trained on multiple representative concepts. New concepts are achieved by fine-tuning LoRA parameters on crucial layers.

Qualitative Comparisons on Various Tasks

Quantitative Results

FID comparison. FID is calculated between the images generated by GAN-based approaches and diffusion models. Reported FID is averaged across different concepts (30 for FFHQ and 10 for Flicker Scenery).

Analysis (FID) of various base models on FFHQ. Left: Training FLOPs. Middle: Training time. Right: Number of parameters that required gradient update, which also equals to the weights need to be saved for a concept.

Left: Analysis (FID) of various base models on FFHQ. Right: Analysis of searching LoRA rank on the Flicker Scenery dataset. The reported FID values are averaged over 10 different target concepts.

Reference

[1] Image-to-Image Translation with Conditional Adversarial Networks

[2] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

[3] Zero-shot Image-to-Image Translation

BibTeX

@article{gong20242,
  title={E $\^{}$\{$2$\}$ $ GAN: Efficient Training of Efficient GANs for Image-to-Image Translation},
  author={Gong, Yifan and Zhan, Zheng and Jin, Qing and Li, Yanyu and Idelbayev, Yerlan and Liu, Xian and Zharkov, Andrey and Aberman, Kfir and Tulyakov, Sergey and Wang, Yanzhi and others},
  journal={arXiv preprint arXiv:2401.06127},
  year={2024}
}