Generative Modeling of Weights:
Generalization or Memorization?

1Princeton University, 2University of Pennsylvania

CVPR 2026


Abstract

Generative models have recently been explored for synthesizing neural network weights. These approaches take neural network checkpoints as training data and aim to generate high-performing weights during inference. In this work, we examine four representative, well-known methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Contrary to claims in prior work, we find that these methods synthesize weights largely by memorization: they produce either replicas, or, at best, simple interpolations of the training checkpoints. Moreover, they fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. Our further analysis suggests that this memorization might result from limited data, overparameterized models, and the underuse of structural priors specific to weight data. These findings highlight the need for more careful design and rigorous evaluation of generative models when applied to new domains.

Background: Generative Modeling of Weights

Building on the success of generative models in image and video synthesis, recent studies have applied them to synthesize weights for neural networks. These methods collect network checkpoints trained with standard gradient-based optimization, and apply generative models to learn the weight distributions and produce new checkpoints, that often perform comparably to conventionally trained weights.


To understand the fundamental mechanisms and the practicality of these methods, we wonder:
have the generative models learned to produce distinct weights that generalize beyond the training ones
or do they merely memorize and reproduce the training data?

We analyze four representative methods, covering different types of generative models and downstream tasks:


Hyper-Representations

Hyper-Representations trains an autoencoder on classification model weights from different runs with identical architectures, fits their latents using KDE, and samples from the fitted distribution.

G.pt

G.pt is a conditional diffusion model trained on checkpoints from tens of thousands of runs. It generates weights for a small predefined model, given initial weights and a target loss.

HyperDiffusion

HyperDiffusion is an unconditional diffusion model trained on neural field MLPs representing 3D shapes. It generates new weights from which meshes can be reconstructed.

P-diff

P-diff trains an unconditional latent diffusion model on 300 checkpoints saved at consecutive steps during an extra training epoch of a base classification model, after it has converged.


Memorization in Weight Space

A natural first step in evaluating the novelty of generated weights is to find the nearest training weights to each generated checkpoint, and check for replications in weight values.

Weight heatmap

We use heatmaps to visualize the model weights at randomly selected parameter indices. In each heatmap, the top row (outlined in red) is a random generated checkpoint, and the three rows below (separated by white lines) are the three nearest training checkpoints. We observe that for every generated checkpoint, at least one training checkpoint is nearly identical to it.

Distance to training weights

We visualize the distribution of the distance from each training and generated checkpoint to its nearest training checkpoint. For all methods except p-diff, the generated checkpoints are significantly closer to the training checkpoints than training checkpoints are to one another. This indicates that these methods produce models with lower novelty than training a new model from scratch.


Memorization in Model Behaviors

Beyond similarity in weight space, we also compare the behaviors of generated models to the behaviors of their nearest training models, and assess whether generative modeling methods differ from a simple noise-addition baseline for creating new weights.

Model outputs

We show the decision boundaries or reconstructed 3D shapes of randomly selected generated checkpoints and their nearest training checkpoints. The generated and nearest training models produce highly similar predictions in image classification, or reconstruct to nearly identical 3D shapes. This suggests that generated weights also closely resemble training weights in model behaviors.

Accuracy-novelty trade-off

We evaluate generated checkpoints by test accuracy (higher is better) for classification models and point cloud distance to test shapes (lower is better) for neural fields. Novelty is measured by maximum prediction error similarity to training checkpoints (lower is better) or point cloud distance to training shapes (higher is better). We compare them to a simple baseline that adds noise to training weights. All methods except p-diff fail to outperform this baseline in obtaining novel and simultaneously high-performing models.


P-diff Generates by Interpolation, Not Generalization

Different from the other methods, p-diff's training checkpoints are saved at consecutive training steps rather than from different training runs. We seek to understand why p-diff can outperform the noise-addition baseline in the accuracy-novelty trade-off.

P-diff's generated weight values concentrate around the average of training values. Averaging weights of models fine-tuned from the same base model is known to improve accuracy. The generated models may achieve higher accuracy by interpolating training weights.

accuracy-novelty trade-off

t-SNE of weight values

We generate new models using two baselines ("averaged" and "gaussian") that approximate interpolations of training weights. We find that both the weight values and behaviors of the generated models closely match those of models from the interpolation baselines.

Understanding Memorization in Weight Generation

We further analyze these methods and find that limited data, overparameterized models, and the underuse of structural priors in weight data likely contribute to this memorization.

Limited Data and Overparameterized Models

Scaling up data is a potential solution for memorization. Expanding the training dataset of G.pt from 2.1M to 20.4M checkpoints can effectively reduce memorization without degrading the performance of the generated weights.

Generative models of weights are overparameterized, enabling memorization. Even when the training checkpoints' weights are random initializations, HyperDiffusion can still memorize these weights without learning meaningful patterns.

Underused Structural Priors in Weight Data

Neural networks have symmetries: certain transformations (e.g., permutation and scaling) can be applied to the weights without changing the model's behavior. Among the four methods, G.pt and Hyper-Representations leverage permutation symmetry, but only as a form of data augmentation. We evaluate whether such augmentations provide meaningful benefits for generative modeling.

We apply function-preserving transformations to the training weights, and reconstruct both the original and transformed weights using the Hyper-Representations' autoencoder. The resulting reconstructions highly differ in accuracy and predictions. For reference, we report the average accuracy difference and prediction similarity between different untransformed training models ("original"). These results suggest that symmetry-based data augmentation alone is insufficient to train the autoencoder to fully capture weight space symmetries.

We add 1, 3, and 7 random weight permutations as data augmentation for training HyperDiffusion, effectively enlarging the dataset by factors of ×2, ×4, and ×8, respectively. Even when we only add a single permutation, HyperDiffusion fails to produce meaningful shapes.

BibTeX

@inproceedings{zeng2026generative,
  title={Generative Modeling of Weights: Generalization or Memorization?},
  author={Boya Zeng and Yida Yin and Zhiqiu Xu and Zhuang Liu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026},
}