Explore the inner workings of modern transformer neural networks
Select a layer to view its attention pattern
The encoder layer consists of multi-head self-attention followed by a position-wise feed-forward network. Each sub-layer has a residual connection and layer normalization.
Multi-head attention allows the model to jointly attend to information from different representation subspaces. It's computed by applying multiple attention functions in parallel and concatenating the results.
Flash Attention is an optimization technique that computes attention efficiently by using tiling to reuse data in fast memory. It reduces memory access and improves training and inference speed.
Quantization reduces the precision of the model weights (e.g., from 32-bit to 8-bit or 4-bit) to decrease memory usage and improve inference speed, with a small sacrifice in accuracy.