Today we consider one of the most beautiful recent results I have encountered in deep learning theory—and I mean “beautiful” in the most literal sense. This is a paper by Ildus Sadrtdinov and collaborators from Dmitry Vetrov's group at Constructor University in Bremen.
The pitch sounds almost too good to be true: the stationary behavior of stochastic gradient descent (SGD) in scale-invariant neural networks is described by the thermodynamics of an ideal gas! Not as a loose metaphor, not as a “well, this kind of looks like entropy” hand-wavy intuitions, but as a quantitative correspondence with specific, experimentally verifiable predictions. Maxwell's relations—yes, Maxwell's relations from your first-year physics course—describe how learning rate and weight decay affect the entropy of the stationary weight distribution, and the results check out on real neural networks.
Two warnings before we begin: first, this post is heavy on equations and light on pretty pictures; but in my opinion, the equations themselves are prettier than any picture. Second, yes, it actually is a little bit too good to be true, and we’ll get there. Let’s dig in!
A Quick Thermodynamics Refresher

Before we can appreciate the analogy, we need to recall some physics. I will try to explain everything from scratch, assuming you are comfortable with SGD and normalization layers but may have forgotten your thermodynamics. (I certainly had to dust off mine.)
What Thermodynamics Is About
Thermodynamics studies the macroscopic behavior of systems with a huge number of particles. Each individual molecule bounces around chaotically, but collectively they are described by just a handful of macroscopic variables: temperature T, pressure p, volume V, internal energy U, and entropy S.
If you have never taken a physics course, or if the last time you thought about thermodynamics was in high school, here is the one-sentence version. Thermodynamics is the art of ignoring the details. A box of gas contains something like 10²³ molecules, each with its own position and velocity, but we do not need to track any of them individually. Instead, we describe the whole system with five or six numbers defined above—temperature, pressure, volume, energy, entropy—and then we can derive remarkably precise predictions only from those macroscopic parameters. The miracle of thermodynamics is that this works at all: that a system with an incomprehensible number of moving parts admits such a compact description.
If that reminds you of something more close to home, say a neural network with millions of parameters whose behavior we try to summarize with a loss curve and a few hyperparameters, you are already thinking in the right direction. The analogy with neural networks is already suggesting itself: instead of gas molecules, we have network parameters; instead of chaotic thermal motion, we have the noise of stochastic gradients. The question is whether this analogy can be made precise.
The First and Second Laws
The first law of thermodynamics is just energy conservation:

Here dU is the change in internal energy, δQ is the heat added to the system, and pdV is the work done by the system as it expands. If you heat a gas, that energy goes into either raising the gas's internal energy or pushing a piston.
The second law says that in a closed system, entropy (a measure of disorder, or more precisely, of the number of microstates compatible with the observed macrostate) never decreases:

Equality holds only in a reversible (quasistatic) process, a process so slow that the system is always approximately in equilibrium. In practice, real processes are irreversible: they generate extra entropy. The universe trends toward disorder, and if you want to create order in one place, you inevitably create more disorder somewhere else.
It is worth pausing on what entropy actually means here, because the word is used a lot in machine learning contexts and its meaning is not always clear. In thermodynamics, entropy measures how many different microscopic arrangements (microstates) are compatible with the same macroscopic observation. A gas that could be in any of a trillion configurations has high entropy; a crystal where every atom is locked in place has low entropy. The second law says that natural processes tend to move from fewer possible arrangements to more possible arrangements, that is, towards higher entropy. This is actually just statistics: there are overwhelmingly more disordered configurations than ordered ones, so a system left to itself will almost certainly drift to disorder. There is an exact parallel in SGD that you may know from your optimization background: when left to itself with a finite learning rate, the optimizer does not sit still at the minimum but wanders around it, exploring many nearby configurations; entropy in this context is precisely a measure of how many such configurations it tends to visit.
The Gibbs Distribution: The Loss Function of Physics
Now we arrive at the single most important object for our analogy. In thermodynamic equilibrium at temperature T, the probability of finding the system in microstate i with energy Ei is given by the Gibbs distribution (also known as the Boltzmann distribution):

If you work with neural networks, you will immediately recognize this as a softmax over energies. In thermodynamics, the temperature T here plays exactly the role you would expect: as T tends to 0, all the probability mass concentrates on the lowest-energy state, that is, the system freezes into its ground state; this is a direct counterpart of gradient descent converging to the loss minimum. At high T, the distribution becomes more uniform, which is similar to SGD with a large learning rate wandering broadly across parameter space.
If you have worked with language models, you have almost certainly encountered temperature scaling at inference time: dividing the logits by a temperature parameter before applying softmax. The Gibbs distribution is exactly the same operation, just applied to energies instead of logits. At low temperature, the model (or the gas) becomes “greedy” and overwhelmingly picks the single best option. At high temperature, it becomes more exploratory, spreading probability mass across many options. The reason temperature scaling in language models is called “temperature” is precisely because of this connection to statistical physics.
From this distribution, we can derive two key quantities:

Thermodynamic Potentials: What Gets Minimized
So far the analogies have been very straightforward, and this was almost high school physics; now the time has come for subtler and more interesting things. Depending on which variables you hold fixed, the system minimizes different quantities, known as thermodynamic potentials:
- for fixed T and V, the system minimizes the Helmholtz free energy F = U – TS;
- for fixed T and p, the system minimizes the Gibbs free energy G = U – TS + pV.
For example, the Helmholtz free energy tells us that the system wants to have low energy U (i.e., to settle into a deep minimum), but it also wants high entropy S (to stay spread out). Temperature T controls the balance between these two competing desires. This should already sound very familiar to you: indeed, this is exactly parallel to regularization, which controls the tradeoff between fitting the training data and keeping the model's weights “simple”.
To make this analogy even more specific: imagine you are training a neural network and you have two knobs. One knob controls how hard the optimizer tries to minimize the training loss (driving toward low energy U). The other controls how much you penalize the model for being too “concentrated” in parameter space, that is, for having all its weight in one narrow region rather than being spread across many roughly-equivalent configurations (favoring high entropy S). The balance between these two is exactly what the Helmholtz free energy F = U − TS captures. Temperature T is the exchange rate: it tells you how many units of energy you are willing to pay for one unit of entropy. At low temperature, you want the sharpest possible minimum; at high temperature, you prefer a broad basin.
Maxwell Relations
From the conditions that thermodynamic potentials are minimized, we can derive Maxwell relations, equalities between partial derivatives of state variables. For example, from the Gibbs free energy we get that

In plain English, this means that how entropy changes when you vary pressure (at constant temperature) is related to how volume changes when you vary temperature (at constant pressure). The left-hand side is dealing with entropy and pressure, and the right-hand side talks about volume and temperature.
These are apparently different things, yet thermodynamics says they have a direct connection. Maxwell relations are powerful because they let you measure quantities that are hard to observe directly, like entropy, through quantities that are easy to measure: volume, pressure, and temperature.
Why do Maxwell relations matter for us? Because in the neural network setting, entropy—which measures how spread out the weight distribution is—is extremely hard to estimate directly, especially when dimensions reach tens of thousands. But if we have a Maxwell relation that connects entropy derivatives to something we can measure (like how the weight norm changes when we tweak the learning rate), we suddenly have an indirect window into a quantity that would otherwise be inaccessible. This is exactly the trick the authors will pull off later in the paper.
The Ideal Gas: The Simplest Model
The ideal gas is the simplest possible thermodynamic model: particles that do not interact with each other. Its behavior is captured by the state equation:

where R is the gas constant. This is the Clapeyron-Mendeleev equation, or ideal gas law, arguably the most famous formula in thermodynamics. For an ideal gas, the heat capacities CV (at constant volume) and Cp (at constant pressure) are constants, with Cp – CV = R. In an adiabatic process (with no heat exchange, δQ = 0), the system satisfies pVɣ = const, where ɣ = Cp / CV.
That is all the physics we need. Now for the main event: what does any of this have to do with neural networks?
Scale-Invariant Networks

Scale Invariance
Nearly all modern neural network architectures include normalization layers: BatchNorm, LayerNorm, RMSNorm, and their variants. The key property of normalization is that the output of a normalized layer does not depend on the scale of its inputs. If you multiply all the weights feeding into a normalization layer by a positive constant ɑ > 0, the layer's output stays the same.
For a fully scale-invariant network (one where normalization is placed wherever needed), the loss function satisfies

The second property is the crucial one. It means the loss depends only on the direction of the weight vector w/||w|| on the unit sphere, not on its norm r = ||w||. The parameter space naturally decomposes into a radius r (the norm) and a direction w/||w||, a point on the unit sphere. The loss lives entirely on the sphere, while the radius is a free variable.
This decomposition into radius and direction is the single most important structural insight in the paper, so it is worth discussing in more detail. Think of it this way: if the loss only depends on the direction of the weight vector, then the entire loss landscape lives on the surface of a high-dimensional sphere. The radius, which shows how far you are from the origin, is irrelevant to the loss itself, but it is relevant to the optimization dynamics because it controls the effective step size. Two networks with the same weight direction but different norms will compute identical outputs, yet SGD will behave very differently for them. This is where thermodynamics enters the picture: the radius becomes a macroscopic degree of freedom, analogous to the volume of a gas, that interacts with the “microscopic” dynamics on the sphere.
Effective Learning Rate
For scale-invariant networks, the quantity that actually controls the dynamics of the loss is not the nominal learning rate η but the effective learning rate (ELR):

This makes intuitive sense: if the weights are large, the gradients are small (by the inverse scaling property), and the effective step on the unit sphere is tiny. Weight decay, in this context, does not act as a traditional regularizer. Instead, it controls the learning rate indirectly: by shrinking the weight norm, weight decay increases the effective learning rate.
This reinterpretation of weight decay is surprising and important: in scale-invariant networks, weight decay is not a regularizer in the traditional sense. It does not directly penalize large weights to prevent overfitting. Instead, it acts as an indirect learning rate controller. Shrinking the weight norm makes the effective learning rate larger, which means noisier, more exploratory updates on the unit sphere. This idea answers why weight decay seems to help even in settings where L2 regularization should not matter at all (because the network is scale-invariant and the norm does not affect the output): weight decay helps not by regularizing, but by controlling the effective temperature of the optimization process.
The Stationary Distribution of SGD
At this point, we get to a fact that is essential for everything that follows. SGD with a finite learning rate does not converge to a single point. Instead, it converges to a stationary distribution around the minimum. The noise from mini-batch sampling prevents the algorithm from ever settling down: it wanders forever around the minimum, and after long enough training, you can meaningfully talk about the distribution of weights.
This is the primary conceptual bridge to thermodynamics. The noise from stochastic gradients is the counterpart of thermal fluctuations. The stationary distribution of SGD is the counterpart of thermodynamic equilibrium.
This point deserves special emphasis because it goes against a common misconception. Many deep learning practitioners think of training as a process that “converges”, meaning that at some point the weights stop changing and you are done. In reality, with SGD and a finite learning rate, the weights never stop changing. They keep bouncing around stochastically, and what “convergence” really means is that the statistics of this bouncing have stabilized. The distribution of weights has reached a steady state, but any individual run is still wandering. This is exactly what happens to gas molecules in a room at constant temperature: each molecule is zooming around chaotically, but the macroscopic properties are perfectly stable. It is the distribution that is at equilibrium, not any individual particle.
The Translation Dictionary
With all the pieces in place, we can now write down the full correspondence between optimization and thermodynamics:

The first few rows are not entirely new. There have been several works (Jastrzębski et al., 2017; Chaudhari & Soatto, 2018, and others) that had already established the energy–entropy–temperature part of the analogy. But the pressure and volume mappings—weight decay as pressure, half the squared norm as volume—are a novel contribution of Sadrtdinov et al., 2025.
Why these identifications? Consider the Gibbs free energy G = U – TS + pV. Translating into optimization, we get

The last term is precisely L2 regularization. So minimizing G at fixed T and p = λ is exactly equivalent to minimizing the regularized loss with an entropy bonus.
This is the moment where magic has happened. We started with a table of loose analogies: “loss is kind of like energy, noise is kind of like temperature”. But by now, we have arrived at an exact identity: the thing that SGD minimizes at stationarity is the Gibbs free energy, with no approximations and no fudge factors, at least for scale-invariant networks with isotropic noise. We have exactly the same mathematical structure, same equations, same minimization principles, and the same relations between partial derivatives. The only question is whether this beautiful structure actually holds up when we move from theory to experiment.
Three Training Protocols and Their Thermodynamic Processes

The authors examine three training setups, each mapping to a different thermodynamic process.
Protocol 1: Training on a Fixed Sphere
The simplest case is to fix the weight norm r and optimize only the direction on the unit sphere. In practice, this means projecting the weights back onto the sphere after each gradient step; this is essentially the approach used in the nGPT paper by Loshchilov et al. (2024), where the entire Transformer lives on a hypersphere.
Here, there is no weight decay (the norm is fixed by construction), volume V = r2/2 is given, and the system minimizes Helmholtz free energy F = U – TS. Temperature is determined by the ELR and the noise variance:

Protocol 2: Fixed ELR with Weight Decay
Now we allow the norm to evolve freely and add weight decay. The SDE approximation shows that the radius r evolves deterministically (!) and converges to a stationary value r*. Under an isotropic noise model:

If we now substitute V = (r*)2 / 2 and p = λ, we get that

This is exactly the ideal gas law. The "gas constant" R = (d – 1)/2 is determined by the dimensionality of the parameter space, and the system minimizes the Gibbs free energy.
Again, this is the same kind of a “magic moment” as before. We took a neural network, wrote down the dynamics of SGD with weight decay, computed the stationary norm of the weight vector, and we got the ideal gas law—the same equation that describes the air in this room. The “gas constant” is half the number of parameters, which for a typical network is a very large number. Again, this is not an approximation or a curve fit, it is an algebraic identity that follows from the structure of the problem. Of course, it depends on assumptions (scale invariance, isotropic noise) that real networks do not perfectly satisfy. But the fact that it works at all, and that it holds up in practice, is remarkable.
Protocol 3: Fixed Learning Rate (How We Actually Train)
This is the standard practice: fixed nominal learning rate η and weight decay λ. Here, the ELR is not fixed directly because it depends on the weight norm, which in turn is determined by the balance between a “centrifugal” force (gradient noise inflates the norm) and a “centripetal” force (weight decay pulls toward zero).
The temperature, as it turns out, depends on both hyperparameters:

Note that T is proportional to the square root of ηλ, not simply to η, as in the standard analogy for non-scale-invariant networks. But the ideal gas equation pV = RT still holds.
Experimental Validation

The authors do not stop at theory. They propose four empirical tests of the analogy and verify them first on an analytically solvable isotropic noise model, then on a real neural network (ResNet-18 on CIFAR-10).
V1: Stationary Radius
The first simple check is to verify that r* is proportional to (η / λ)1/4 for fixed LR. This is a direct consequence of the SDE, and it is not specific to the thermodynamic analogy, but it works as a sanity check. It passes cleanly, with deviations appearing only at large η and λ where the continuous-time SDE approximation breaks down (the authors attribute this to discretization error).
V2: Minimization of Thermodynamic Potentials
If the analogy is correct, the stationary distribution of SGD should minimize the appropriate potential (F or G) not just for the given hyperparameters, but among all stationary distributions induced by other hyperparameters.
The authors test this by computing G for every pair (η, λ) across all other pairs and confirming that the minimum falls on the “correct” point. On the isotropic model this works perfectly. On real neural networks, direct verification is harder because the potential Φ(w) is not known in closed form.
V3: Maxwell Relations
This is, in my opinion, the most impressive part of the paper. For fixed LR, the Maxwell relation takes an elegant form:

This is a specific verifiable prediction: the difference of entropy derivatives with respect to the logarithms of learning rate and weight decay equals half the parameter space dimension.
The authors estimate entropy using a nearest-neighbor estimator, approximate the dependence with a quadratic function, and find that the Maxwell relation holds experimentally to better than 2.5% accuracy for ResNet-18 on CIFAR-10.
Let me just pause and emphasize how remarkable this is. We take a real neural network, train it with different hyperparameters, estimate the entropy of the stationary weight distribution (which is itself nontrivial in a space of dimension about 44,000), and this entropy obeys a relation derived from an analogy with the ideal gas. This is the point where the metaphor is turning into a quantitative check, and the check indeed passes.
To give you a sense of why this is hard: estimating entropy in 44,000 dimensions from a finite number of samples is a notoriously difficult statistical problem. The curse of dimensionality means that nearest-neighbor distances behave very differently from what our low-dimensional intuition suggests. The fact that the authors manage to estimate entropy derivatives accurately enough to verify a precise quantitative prediction is a technical feat in itself. Getting within 2.5% on this number is indeed impressive.
V4: The Adiabatic Process
An adiabatic process has no heat exchange (δQ = 0) and preserves entropy. For the isotropic model with a von Mises–Fisher distribution on the sphere, the authors show that ɣ = Cp / CV = 2, and the adiabatic invariant pVɣ = const at ɣ = 2 reduces to a constant η.
In other words, they confirm experimentally that if you fix the learning rate and vary only weight decay, the entropy of the stationary distribution stays constant. This experiment also provides a recipe for adjusting hyperparameters while preserving the entropy level: an adiabatic schedule.
From Ideal Gas to Real Networks

Everything I have described so far assumes isotropic noise, that is, a noise covariance matrix of the form

Real neural networks, of course, have anisotropic noise.
What changes when we drop this assumption? The authors show that the overall structure survives, but with several important caveats.
1. The “energy” is no longer the training loss L(w) but an implicit potential Φ(w) that depends on both L(w) and the covariance matrix Σw. A closed-form expression for Φ is unknown in general (except for linear regression, where Kunin et al., 2021 derived it).

This is both a strength and a weakness of the work: a strength because the analogy is robust even when its assumptions are violated, but a weakness because we cannot yet explain why it is so robust. The authors suggest exploring the transition from an ideal gas to a real gas with a compressibility factor Z(p, T), writing V = Z(p, T)*RT / p. This is an attractive idea, but it remains to be developed.
This situation is actually quite common in physics: a simple model (ideal gas) captures the qualitative behavior and even many quantitative features, while a more refined model (real gas with interactions) is needed for full accuracy. The Van der Waals equation, for instance, modifies the ideal gas law to account for molecular attraction and finite molecular volume, and it does a much better job at high pressures and low temperatures. It would be fascinating to see a “Van der Waals equation for neural networks”: a corrected state equation that accounts for the anisotropy of gradient noise and possibly for correlations between parameters. But that is future work, and the ideal gas has already taken us surprisingly far.
Why Should You Care: Practical Implications

Alright, the analogy is beautiful and the experiments work out. But what does this mean for practitioners?
The thermodynamic picture does not immediately hand you a new optimizer or a magic learning rate schedule. What it does is something perhaps even more valuable: it gives you a new language and a set of constraints for reasoning about hyperparameters. Instead of thinking of learning rate and weight decay as two independent knobs that you tune by grid search, you can think of them as controlling temperature and pressure in a physical system, with all the relationships that entails. This does not eliminate the need for experiments, but it can significantly narrow the search space and tell you which combinations of changes are “
“thermodynamically consistent” and which are not.
Learning Rate Scheduling
Maxwell's relations give us a quantitative link between hyperparameters and entropy. If we want to control the rate at which entropy decreases (that is, how fast the weight distribution "collapses" to a narrow region around the minimum), the relation

tells us precisely how to adjust η and λ. If the entropy collapse comes too soon, it means premature convergence to a sharp minimum, which leads to poor generalization. But an overly slow collapse wastes compute.
Weight Averaging
Stochastic weight averaging (SWA) requires a balance: each individual model should have low loss (small U), but the models should be diverse enough to benefit from averaging (large S). This is exactly the tradeoff controlled by temperature T.
In a prior paper, Sadrtdinov et al. (2024) showed that the optimal learning rate for weight averaging is often above the convergence threshold, which is fully consistent with the need to maintain high entropy.
Intuition for Hyperparameter Tuning
I especially liked how the thermodynamic picture makes the intuition behind hyperparameter choices vivid and almost obvious: it is essentially high school physics, or even just common sense. Here is the intuition:
- increase learning rate → raise temperature → broader exploration of the loss landscape;
- increase weight decay → raise pressure → reduce volume (weight norm) → at fixed temperature, this also increases the effective learning rate;
- adiabatic process → a recipe for changing hyperparameters while preserving entropy;
- cooldown (reducing LR at the end of training) → cooling a gas → condensation around the minimum.
Limitations and Future Directions

While beautiful, this is still a work highly constrained by its assumptions. Let me offer a few thoughts on what is not yet perfect and where the work might go next.
The scale invariance requirement is the most problematic part: full scale invariance is a very strong assumption. In real networks, even in standard CNNs, affine parameters in BatchNorm, skip connections, and the final linear layer all break it. In their experiments, the authors get specially prepared networks by using BatchNorm without affine parameters and freezing the last layer. Extending the theory to non-scale-invariant architectures will require rethinking the notion of “volume” since the current definition relies on the deterministic evolution of the norm.
Other optimizers have not been considered yet: all of this applies only to SGD with a constant or scheduled learning rate. Adam and AdamW introduce a preconditioner, an additional matrix that rescales the gradient and depends on the weights and training history. This changes the balance between the “centrifugal” and “centripetal” forces, and the ideal gas equation will presumably have to be modified. Understanding how adaptive methods fit into the thermodynamic picture is an obvious and important open question.
Entropy estimation in high dimensions: the authors use a nearest-neighbor entropy estimator with bias of order O(N-2/d). With d approx. 44000 and N = 1000 samples, the absolute bias is enormous. The authors assume that the bias is approximately the same across different stationary distributions (so that entropy derivatives are estimated correctly even if absolute entropy values are not), and this works empirically, but a rigorous justification is still missing.
Overparameterization: the authors show that for heavily overparameterized models (width multiplier k = 32 instead of k = 4), the analogy breaks down at small ELR since the network enters an “interpolation regime” where the noise vanishes and there is no stationary distribution. Thermodynamically, this corresponds to the degenerate case where T goes to zero. Interestingly, this is exactly the regime where phenomena like grokking are known to appear, so there may be more physics to uncover here.
The connection to grokking is particularly tantalizing. Grokking is the phenomenon where a heavily overparameterized network first memorizes the training data (achieving zero training loss with no generalization) and then, after continued training far beyond apparent convergence, suddenly learns to generalize (Power et al., 2022):

In the thermodynamic picture, the memorization phase corresponds to T → 0: the noise vanishes, the distribution collapses to a point, and there is no stationary distribution in the usual sense. But grokking shows that something interesting happens after this collapse: the network somehow escapes and finds a generalizing solution. Understanding this phase transition in thermodynamic terms could be a breakthrough, and the framework in this paper seems like exactly the right starting point.
Conclusion

The work of Ildus Sadrtdinov and colleagues from Vetrov's group demonstrates that the analogy between neural network training and thermodynamics is not just a pretty metaphor but a quantitatively precise correspondence. Well, at least for scale-invariant networks trained with SGD. The ideal gas equation pV = RT links weight decay, weight norm, learning rate, and gradient noise variance in a single formula. Maxwell's relations yield specific, testable predictions about entropy. Adiabatic invariants describe how to change hyperparameters while preserving entropy.
But perhaps the most important takeaway is not the specific formulas. It is the fact itself: statistical physics, developed to describe the behavior of gases and to engineer steam engines, turns out to be the right language for describing neural network training in our time.
Thermodynamics is the science of systems with a huge number of degrees of freedom and complex interactions that nevertheless admit simple macroscopic descriptions. Neural networks are also systems with a huge number of degrees of freedom and complex interactions. Maybe it is not so surprising after all that the same mathematics applies.
I feel like there is a deeper philosophical point here as well. For decades, deep learning theory has struggled with a fundamental question: why do neural networks generalize at all? Classical learning theory, with its VC dimensions and Rademacher complexities, consistently predicts that overparameterized networks should overfit catastrophically; but they do not. The thermodynamic perspective offers a fresh angle: maybe the right way to think about generalization is not through counting parameters or measuring capacity, but through entropy and free energy. A network driven by SGD into a high-entropy region of parameter space is, by definition, one that sits in a broad, flat basin, and broad basins are exactly the ones that generalize well. The thermodynamic framework makes this intuition precise and quantitative.
And I should confess that I have left out what is arguably the most technically interesting part of the paper. Under the hood, the evolution of the system is described by stochastic differential equations, and the same SDEs govern the dynamics of neural network weights. But that is a story for another time. I very much hope there will be an occasion.

