Inputs are initial passed through some fully linked layer, to your double-layer residual multihead attention as shown in Fig. seven. Residual networks (Kaiming He, 2016), include feedforward to stop neurons from suffering from exploding or vanishing gradients for the duration of the training method. The entirely connected layers during the residual