Optimizing ML training with metagradient descent

83 points by ladberg 4 months ago

munchler 4 months ago

In the ML problem I'm working on now, there are about a dozen simple hyperparameters, and each training run takes hours or even days. I don't think there's any good way to search the space of hyperparameters without a deep understanding of the problem domain, and even then I'm often surprised when a minor config tweak yields better results (or fails to). Many of these hyperparameters affect performance directly and are very sensitive to hardware limits, so a bad value leads to an out-of-memory error in one direction or a runtime measured in years in the other. It's a real-world halting problem on steroids.

This is not to even mention more complex design decisions, like the architecture of the model, which can't be captured in a simple hyperparameter.

pama 4 months ago

Optuna often works fine in this context (even with the memory errors or, with some tuning, with the non-halting runs): https://github.com/optuna/optuna
lamename 4 months ago

You might find this helpful for prioritizing which knobs to turn first https://github.com/google-research/tuning_playbook
- OccamsMirror 4 months ago
  
  Starting to get a bit out of date. Pity they stopped updating it.
  
  ayepif 4 months ago
  
  Pity indeed! Do you have any suggested resources that are more up-to-date?
  
  lamename 4 months ago
  
  Last commit 1 yr. They welcome your contributions in CONTRIBUTING.md
brandonpelfrey 4 months ago

Are you already employing Bayesian optimization techniques? These are commonly used to explore spaces where evaluation is expensive.
- riedel 4 months ago
  
  They also depend on the design space to somewhat friendly in nature and can be modelled by a surrogate, so that exploit/explore can be modelled in an acquisition function.
  Also successive halving e.g. build on assumptions how the learning curve develops.
  Bottom line is that there is hyperparams for hyperparam searches again. So one starts building hyperparam heuristics on top of the hyperparam search.
  In the end there is no free lunch. But if hyperparam search strategy somewhat works in a domain it is a great tool. Good thing is that one can typically encode the design space in Blackbox optimization algorithms more easily.
jampekka 4 months ago

I've been wondering how the training process of the huge models works in practice. If an optimization run costs millions, they probably don't just run a grid of hyperparameters.
- yorwba 4 months ago
  
  Run a grid of hyperparameters for small models of different sizes to find out how the optimal values change as you scale up (the "scaling laws"), then extrapolate to predict performance at even larger scales, then do a single large run and hope that your predictions aren't too far off.
logicchains 4 months ago

That's the advantage of deep learning over traditional ML: if you've got enough data, you don't need domain knowledge or hyperparameter tuning, just throw a large enough universal approximator at it. The challenge lies in generating good enough artificial data for domains without enough data, and getting deep models to perform competitively with simpler models.

jn2clark 4 months ago

How does it compare to previous work on learning to learn? I don't see it referenced https://arxiv.org/abs/1606.04474

MITSardine 4 months ago

Great article, very clear. I was particularly curious because I've observed functions of the form

f : x -> argmin_y g(x,y)

are generally not smooth, like not even continuous, even though, as the article points out, every step of the program to approximate argmin is differentiable.

There are several issues with this kind of function. First, it's not even necessarily well defined. There could be two or more argmin values for each x value. Your algorithm will find just the one, but the underlying mathematical object is still ill-defined.

A second issue is that optimization algorithms don't converge to argmin to begin with, but to local minima. The value can differ dramatically in a completely unpredictable way. "Converge" is also an important word, as they stop before reaching the exact value; if g has mins that vary by less than your floating point representation can handle (for an extreme case), the problem is effectively numerically ill-posed even though it could be mathematically well-posed.

Lastly, even when everything is well behaved, that is single global optimum, ideal optimizer that finds it exactly, g is infinitely smooth, then consider:

- g(x,y) = sin(x)cos(y) defined on [-pi/2,pi/2]^2

- for x in ]0,pi/2], sin(x) > 0, thus sin(x)cos(y) is minimized for y = -pi/2

- for x in [-pi/2,0[, sin(x) < 0, thus sin(x)cos(y) is minimized for y = pi/2

It follows that the function f is -pi/2 Heaviside, which is not even continuous. In conclusion, even when everything is very simple, perfectly behaved, any approach that relies on differentiating the result of a minimization procedure is standing on some very fragile scaffolding.

For these reasons, I find the article stops at the most important part: it assesses the smoothness of such a function f empirically, but it does not explain how you might go about regularizing one found in the wild. This is the most crucial step in my opinion, because otherwise we're still somewhat stuck at tinkering effectively blindly in most cases. I don't mean any disrespect of what's otherwise very interesting and clearly thorough work, but since this is posted on a "layman forum", I just meant to point out what I think is an important limitation to how far-reaching the work here is.

---

Typo in Definition 2: 2hv? Or is this related to h(\Delta_f(z+h;h) - \Delta_f(z;h))? (unsure if it's the authors posting this)

I personally also find the Delta_f notation confusing, I think a more elegant notation coherent with the paper's other notations would have been

\frac{\delta f}{\delta v}(z;h)

Note that df/dx is a directional derivative wrt the basis vector x, it's no trouble putting any v there. Lowercase delta is often used as a discrete difference, so I think this fits the finite difference gradient well. As for the (z;h), this mirrors how the differential is often written: at z and applied to h. (the h is lost in the article's \Delta_f notation)

Question on Definition 2: why take the sign of the vectors, and then compute a weighted scalar product of that? not just compute the scalar product of the raw vectors and normalize by their norms? This would be in [-1,1] as well, so I'm curious why this particular measure was chosen.