EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling


1. Archimedes/Athena RC | 2. valeo.ai | 3. National Technical University of Athens | 4. University of Crete | 5. IACM-Forth
Intro Image

Latent Space Structure (Left) Top three principal components of SD-VAE and SDXL-VAE, with and without our method, demonstrating visually that our regularization produces smoother latent representations without compromising reconstruction (See Table). Accelerated Training (Right) Training curves (without classifier-free guidance) for DiT-XL/2 and REPA (w/ SiT-XL/2).

Abstract

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By fine-tuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA, and MaskGIT, achieving a ×7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models.


Latent representations with EQ-VAE

                            We demontration of the effect of our regularization by illustrating the latent representations of consecutive video frames.

Method

Motivation Our work is motivated by a key observation: state-of-the-art autoencoders, such as SD-VAE, produce latent representations that are not equivariant under basic spatial transformations like scaling and rotation. To test this, we applied scaling and rotations directly to the latent code and evaluated the corresponding reconstructions. While autoencoders reconstruct images accurately when transformations are applied to the input (i.e., \( \mathcal{D}(\mathcal{E}(\mathbf{\tau} \circ \mathbf{x})) \)), applying transformations directly to the latent representation (i.e., \( \mathcal{D}(\mathbf{\tau} \circ \mathcal{E}(\mathbf{x})) \)) leads to significant degradation in reconstruction quality:

Intro Image

Equivariance Regularization: To overcome this limitation, we propose a regularization objective that aligns the reconstructions of transformed latent representations (\(\mathcal{D}\big( \tau \circ \mathcal{E}(\mathbf{x}) \big)\)) with the corresponding transformed inputs (\(\tau \circ \mathbf{x}\)). Specifically, we modify the original training objective of \(\mathcal{L}_{\text{VAE}}\) as follows:

\[ \begin{aligned} \mathcal{L}_{\text{EQ-VAE}} (\mathbf{x}, \textcolor[RGB]{225,0,100}{\tau}) &= \mathcal{L}_{rec}\Big( \textcolor[RGB]{225,0,100}{\mathbf{\tau} \circ} \mathbf{x},\, \mathcal{D}\bigl( \textcolor[RGB]{225,0,100}{\mathbf{\tau} \circ}\mathcal{E}(\mathbf{x}) \bigr) \Big) + \lambda_{gan}\,\mathcal{L}_{gan}\Big( \mathcal{D}\bigl( \textcolor[RGB]{225,0,100}{\mathbf{\tau} \circ}\mathcal{E}(\mathbf{x}) \bigr) \Big) + \lambda_{reg}\,\mathcal{L}_{reg} \notag \end{aligned} \] Notice that when \(\textcolor[RGB]{225,0,100}{\mathbf{\tau} }\) is the identity transformation, this formulation reduces to the original objective of SD-VAE We focus on two types of spatial transformations: anisotropic scaling and rotations . These are parameterized as: \[ \mathbf{S}(s_x, s_y) = \begin{bmatrix} s_x & 0 \\ 0 & s_y \end{bmatrix} ,\quad \mathbf{R}(\theta)= \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \] The final transformation is the composition of scaling and rotation: \(\tau = \mathbf{S}(s_x, s_y) \cdot \mathbf{R}(\theta)\). We sample uniformly \(0.25 < s_x, s_y < 1\), and \(\theta \in \left(\frac{\pi}{2}, \pi, \tfrac{3\pi}{2}\right)\).

Results

We demontrate images generated by two DiT-XL/2 models for 50K 100K and 400K iterations, one trained on the latent distribution of the standard SD-VAE and one on our regularized EQ-VAE. Both models share the same noise and number of sampling steps. The model trained on EQ-VAE is significantly speed-up.

Intro Image

(Table Left) Our regularization seamlessly adapts to both continuous and discrete autoencoders. Finetuning pretrained autoencoders with EQ-VAE reduces the equivariance error under spatal transformations resulting in enhancement in generetaive performance (gFID) while maintaining the prior reconstruction capabilities (rFID). (Table Right) State-of-the-art diffusion transormer models experience significant perfmance boost when trained on the latent representations on EQ-VAE

Figure 1
Figure 2

Analysis

An example image
How fast is EQ-VAE regularization? We train a DiT-B/2 model on the resulting latent distribution of each epoch and present the results in Even with a few epochs of fine-tuning with EQ-VAE, the gFID drops significantly, highlighting the rapid refinement our objective achieves in the latent manifold.



An example image Latent space complexity and generative performance. We observe a correlation be- tween the intrinsic dimension (ID) of the latent manifold and the resulting generative performance. This suggests that the regularized latent becomes simpler to model.

Cite Us

EQ-VAE

@inproceedings{eqvae,
  title = {EQ-VAE: Equivariance Regularized Latent Space for
    Improved Generative Image Modeling},
  author = {Kouzelis, Theodoros and Kakogeorgiou, Ioannis and Gidaris, Spyros and Komodakis, Nikos},
  booktitle = {arxiv},
  year = {2025},
}
        
--> --> -->