Transformer: Concept and code from scratch
Transformers are novel neural networks that are mainly used for sequence transduction tasks. Sequence transduction is any task where input sequences are transformed into output sequences. Most competitive neural sequence transduction models have an encoder-decoder structure. The encoder maps an input sequence of symbol representations to a sequence of continuous representations, the decoder then generates an output sequence of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next....
Convergence of gradient descent in over-parameterized networks
Neural networks typically have very large number of parameters. Depending on whether they have more parameters than training instances, they are over-parameterized or under-parameterized. In either case, their loss function is a multivariable, multidimensional and often non-convex function. In this post, we study over-parameterized neural networks and their loss landscape; we answer the question of why gradient descent (GD) and its variants converge to global minima in over-parameterized neural networks, even though their loss function is non-convex....