Self-attention and multi-head attention mechanisms | Deep Learning Systems Class Notes