E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation
Yifan Gong1,2     Zheng Zhan2     Qing Jin1     Yanyu Li1,2     Yerlan Idelbayev1    Xian Liu1    Andrey Zharkov1    Kfir Aberman1    Sergey Tulyakov1    Yanzhi Wang2    Jian Ren1
1Snap Inc.    2Northeastern University   
Overview of E2GAN. Left: Training Comparison. Conventional GAN training, such as pix2pix [1] and pix2pix-zero-distilled that distills Co-Mod-GAN [2] using data from pix2pix-zero [3], requires all the weights trained from scratch, while our efficient training significant reduces the training cost by only fine-tuning 1% weights with only portion of training data. Right: Mobile Inference Compari- son. Our efficient on-device model can achieve real-time (30FPS, iPhone 14) runtime and is faster than pix2pix and diffusion model, while the pix2pix-zero-distilled model (Co-Mod-GAN) is not supported on device.
Demo Video
We present a short demo video, mostly with quick overview of motivations, our framework design, and visualization results.
Abstract
One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models, such as Stable Diffusion, to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with diffusion models. However, unlike text-to-image diffusion models, each distilled GAN is specialized for a specific image editing task, necessitating costly training efforts to obtain models for various concepts. In this work, we introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient? To achieve this goal, we propose a series of innovative techniques.First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch. Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model. Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time. Extensive experiments show that we can efficiently empower GANs with the ability to perform real-time high-quality image editing on mobile devices with remarkable reduced training cost and storage for each concept.
Model Architecture Overview
Overview of E2GAN model architecture. The generator is composed of down/up-sampling layers, 3 ResNet Blocks, and 1 Transformer Block. The base generator is trained on multiple representative concepts. New concepts are achieved by fine-tuning LoRA parameters on crucial layers.
Qualitative Comparisons on Various Tasks
Quantitative Results
FID comparison. FID is calculated between the images generated by GAN-based approaches and diffusion models. Reported FID is averaged across different concepts (30 for FFHQ and 10 for Flicker Scenery).

Analysis (FID) of various base models on FFHQ. Left: Training FLOPs. Middle: Training time. Right: Number of parameters that required gradient update, which also equals to the weights need to be saved for a concept.

Left: Analysis (FID) of various base models on FFHQ. Right: Analysis of searching LoRA rank on the Flicker Scenery dataset. The reported FID values are averaged over 10 different target concepts.

Reference

[1] Image-to-Image Translation with Conditional Adversarial Networks

[2] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

[3] Zero-shot Image-to-Image Translation

BibTeX
@article{gong20242,
  title={E $\^{}$\{$2$\}$ $ GAN: Efficient Training of Efficient GANs for Image-to-Image Translation},
  author={Gong, Yifan and Zhan, Zheng and Jin, Qing and Li, Yanyu and Idelbayev, Yerlan and Liu, Xian and Zharkov, Andrey and Aberman, Kfir and Tulyakov, Sergey and Wang, Yanzhi and others},
  journal={arXiv preprint arXiv:2401.06127},
  year={2024}
}