Align-Prop

Abstract

Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to the unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works finetune diffusion models to downstream reward functions using vanilla reinforcement learning, notorious for the high variance of the gradient estimators. In this paper, we propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient through the denoising process. While naive implementation of such backpropagation would require prohibitive memory resources for storing the partial derivatives of modern text-to-image models, AlignProp finetunes low-rank adapter weight modules and uses gradient checkpointing, to render its memory usage viable. We test AlignProp in finetuning diffusion models to various objectives, such as image-text semantic alignment, aesthetics, compressibility and controllability of the number of objects present, as well as their combinations. We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler, making it a straightforward choice for optimizing diffusion models for differentiable reward functions of interest.

AlignProp

Given a batch of prompts, AlignProp generates images from noise through DDIM Sampling. The generated images are then evaluated using a Reward model to get a reward score. The optimization process involves updating the weights in the diffusion process by minimizing the negative of the obtained reward through gradient descent. To mitigate overfitting, we randomize the number of time-steps we backpropagate gradients to.

Sample and Data Efficiency in Reward Finetuning

DDPO is the current state-of-the-art method for fine-tuning text-to-image models using reinforcement learning. However it uses REINFORCE for finetuning the diffusion process. RL methods such as REINFORCE are notorious for their high variance in gradients and thus often result in poor sample efficiency. We on the other hand, do backpropogation and send gradients from the reward model to the diffusion process. This results in a significant boost in convergence speed and sample efficienncy. In the top half of the figure, we compare the data efficiency of AlignProp and DDPO across various reward models. In the bottom half of the figure we compare the convergence speed of AlignProp and DDPO. As can be seen, AlignProp is about 25 times faster in training than DDPO. For instance on HPS v2 dataset scenario, AlignProp achieves a score of 2.8 in just 48 minutes, whereas DDPO requires approximately 23 hours.

Interactive Mixing Results

In this context, we demonstrate the ability of AlignProp to interpolate between different reward functions during the inference phase. We draw inspiration from the concept presented in ModelSoup, which showcases how averaging the weights of multiple fine-tuned models can enhance image classification accuracy.Expanding upon this idea, we extend it to the domain of image editing, revealing that averaging the LoRA weights of diffusion models trained with distinct reward functions can yield images that satisfy multiple reward criteria. AlignProp adeptly demonstrates its capacity to interpolate between distinct reward functions, achieving the highest overall reward when the mixing coefficient is set to 0.5. Please move the slider to visualize results with different mixing coefficients.

Aesthetic Model

Hybrid Model

Compression Model

Aesthetic Model

Hybrid Model

Compression Model

0.5

BibTeX

@misc{prabhudesai2023aligning,
              title={Aligning Text-to-Image Diffusion Models with Reward Backpropagation}, 
              author={Mihir Prabhudesai and Anirudh Goyal and Deepak Pathak and Katerina Fragkiadaki},
              year={2023},
              eprint={2310.03739},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
              }

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

AlignProp is a direct backpropagation-based approach to finetune text-to-image diffusion models for desired reward function. Above we show finetuning results for various reward functions.

Abstract

AlignProp

Sample and Data Efficiency in Reward Finetuning

Interactive Mixing Results

Aesthetic Model

Hybrid Model

Compression Model

Aesthetic Model

Hybrid Model

Compression Model

Hybrid Model - Mixing Coefficient (α)

BibTeX