I never did as much thinking or testing of dropout on transformers as the author, but it didn't seem to help with my "baby" (~10 million param) transformer models. IIRC the latest Llama models don't use dropout either.
Same, I was never able to debug why dropout > 5% really hurt convergence speed for my toy LLMs. I chalked it up to the models not having enough parameters to fit fineweb and just stop using it.
My intuition is very undeveloped on this, but it makes some kind of sense to me that dropout would make convergence slower, because you're ignoring a bunch of parameters in every batch. The goal seems to be to get a better, more general model by trading off some training time.
> So you'd call the dropout function on the activations from each layer, zeroing out some at random so that they don't contribute to the "downstream" calculations. (As I understand it, this means that they are also not adjusted during back-propagation -- if nothing else, it would be terribly unfair to the poor ignored neurons to have their weights changed when they didn't contribute to the error.)
If the weights are effectively set to zero by the dropout, shouldn't the propagated error in the backward pass be zero too, automatically?
(I.e., as I understand it, OP's intuitive notion of "fairness" is literally how the error propagation works: Neurons are adjusted by the degree by which they contributed to the output)
OP here -- I'm new at this, but I don't think so. A zero output from a neuron still contributes to the output. Taking a silly toy case, imagine a network with one input, one neuron, one output. You're training to it match data where whatever the input is, it outputs 1 -- that is, your target state would have the weight set to zero and the bias set to 1. If it was initialised with the weight zero but the bias also zero, then when you pushed your test set through you'd get zero outputs, but the error would be non-zero and there would be adjustments to propagate back.
Hey, thanks for the replay and sorry for the belated answer.
That's a good example. Read up backpropagation on wikipedia again, and I think you're right there and I had some misunderstandings.
Eq. 4 of [1] says:
> However, if [neuron] j is in an arbitrary inner layer of the network, finding the derivative of [loss for one specific target value] E with respect to [output of j] o_j is less obvious. [...]
Considering E as a function with the inputs being all neurons L = { u , v , … , w } receiving input from neuron j, [...] and taking the total derivative with respect to o_j, a recursive expression for the derivative is obtained:
[derivative of E with respect to o_j] = sum ℓ in L ( [derivative of E with respect to o_ℓ] [derivative of o_ℓ with respect to net_ℓ] * [weight of neuron ℓ for o_j])
Therefore, the derivative with respect to o_j can be calculated if all the derivatives with respect to the outputs o_ℓ of the next layer – the ones closer to the output neuron – are known.*
So if a neuron is disabled through dropout, this would affect all neurons in the layer "before" it (i.e. closer to the input layer).
I think you could also argue that a dropped out neuron has its set L being artificially set to empty, so the sum in the formula would reduce to zero. But that would indeed be something different than setting the weight to zero.
https://www.manning.com/books/build-a-large-language-model-f...
OP here -- that's the one! Highly recommended.
I never did as much thinking or testing of dropout on transformers as the author, but it didn't seem to help with my "baby" (~10 million param) transformer models. IIRC the latest Llama models don't use dropout either.
Same, I was never able to debug why dropout > 5% really hurt convergence speed for my toy LLMs. I chalked it up to the models not having enough parameters to fit fineweb and just stop using it.
My intuition is very undeveloped on this, but it makes some kind of sense to me that dropout would make convergence slower, because you're ignoring a bunch of parameters in every batch. The goal seems to be to get a better, more general model by trading off some training time.
The Llama thing is interesting, though!
> So you'd call the dropout function on the activations from each layer, zeroing out some at random so that they don't contribute to the "downstream" calculations. (As I understand it, this means that they are also not adjusted during back-propagation -- if nothing else, it would be terribly unfair to the poor ignored neurons to have their weights changed when they didn't contribute to the error.)
If the weights are effectively set to zero by the dropout, shouldn't the propagated error in the backward pass be zero too, automatically?
(I.e., as I understand it, OP's intuitive notion of "fairness" is literally how the error propagation works: Neurons are adjusted by the degree by which they contributed to the output)
OP here -- I'm new at this, but I don't think so. A zero output from a neuron still contributes to the output. Taking a silly toy case, imagine a network with one input, one neuron, one output. You're training to it match data where whatever the input is, it outputs 1 -- that is, your target state would have the weight set to zero and the bias set to 1. If it was initialised with the weight zero but the bias also zero, then when you pushed your test set through you'd get zero outputs, but the error would be non-zero and there would be adjustments to propagate back.
I could well be misunderstanding you, though!
Hey, thanks for the replay and sorry for the belated answer.
That's a good example. Read up backpropagation on wikipedia again, and I think you're right there and I had some misunderstandings.
Eq. 4 of [1] says:
> However, if [neuron] j is in an arbitrary inner layer of the network, finding the derivative of [loss for one specific target value] E with respect to [output of j] o_j is less obvious. [...]
Considering E as a function with the inputs being all neurons L = { u , v , … , w } receiving input from neuron j, [...] and taking the total derivative with respect to o_j, a recursive expression for the derivative is obtained:
[derivative of E with respect to o_j] = sum ℓ in L ( [derivative of E with respect to o_ℓ] [derivative of o_ℓ with respect to net_ℓ] * [weight of neuron ℓ for o_j])
Therefore, the derivative with respect to o_j can be calculated if all the derivatives with respect to the outputs o_ℓ of the next layer – the ones closer to the output neuron – are known.*
So if a neuron is disabled through dropout, this would affect all neurons in the layer "before" it (i.e. closer to the input layer).
I think you could also argue that a dropped out neuron has its set L being artificially set to empty, so the sum in the formula would reduce to zero. But that would indeed be something different than setting the weight to zero.
[1] https://en.wikipedia.org/wiki/Backpropagation#Derivation