I never did as much thinking or testing of dropout on transformers as the author, but it didn't seem to help with my "baby" (~10 million param) transformer models. IIRC the latest Llama models don't use dropout either.
Same, I was never able to debug why dropout > 5% really hurt convergence speed for my toy LLMs. I chalked it up to the models not having enough parameters to fit fineweb and just stop using it.
> So you'd call the dropout function on the activations from each layer, zeroing out some at random so that they don't contribute to the "downstream" calculations. (As I understand it, this means that they are also not adjusted during back-propagation -- if nothing else, it would be terribly unfair to the poor ignored neurons to have their weights changed when they didn't contribute to the error.)
If the weights are effectively set to zero by the dropout, shouldn't the propagated error in the backward pass be zero too, automatically?
(I.e., as I understand it, OP's intuitive notion of "fairness" is literally how the error propagation works: Neurons are adjusted by the degree by which they contributed to the output)
https://www.manning.com/books/build-a-large-language-model-f...
OP here -- that's the one! Highly recommended.
I never did as much thinking or testing of dropout on transformers as the author, but it didn't seem to help with my "baby" (~10 million param) transformer models. IIRC the latest Llama models don't use dropout either.
Same, I was never able to debug why dropout > 5% really hurt convergence speed for my toy LLMs. I chalked it up to the models not having enough parameters to fit fineweb and just stop using it.
> So you'd call the dropout function on the activations from each layer, zeroing out some at random so that they don't contribute to the "downstream" calculations. (As I understand it, this means that they are also not adjusted during back-propagation -- if nothing else, it would be terribly unfair to the poor ignored neurons to have their weights changed when they didn't contribute to the error.)
If the weights are effectively set to zero by the dropout, shouldn't the propagated error in the backward pass be zero too, automatically?
(I.e., as I understand it, OP's intuitive notion of "fairness" is literally how the error propagation works: Neurons are adjusted by the degree by which they contributed to the output)