As part of the development process for our NovelAI Diffusion image generation models, we modified the model architecture of Stable Diffusion and its training process.
These changes improved the overall quality of generations and user experience and better suited our use case of enhancing storytelling through image generation.
In this blog post, we’d like to give a technical overview of some of the modifications and additions we performed.
Stable Diffusion uses the final hidden states of CLIP’s transformer-based text encoder to guide generations using classifier free guidance.
In Imagen (Saharia et al., 2022), instead of the final layer’s hidden states, the penultimate layer’s hidden states are used for guidance.
Discussions on the EleutherAI Discord also indicated, that the penultimate layer might give superior results for guidance, as the hidden state values change abruptly in the last layer, which prepares them for being condensed into a smaller vector usually used for CLIP based similarity search.