Estimating model parameters for transformer models
Author: Asim Kadav
Self-Attention Mechanism Parameters Calculation
For recent projects, I needed a simple way of *estimating* total model parameters for Transformers. I propose a method (using other well known techniques), and compare it with how the method compares with real world parameter counts of GPT* models.
In the self-attention mechanism of the transformer, the input representation is projected into three different representations: Query (Wq), Key (Wk), and Value (Wv).
Each of these projections has its own weight matrix.
Given:
- dmodel: the dimension of the model's input (or the embedding size).
- dhead: the dimension of each attention head.
- nheads: the number of attention heads.
The projections are as follows:
- Wq: dmodel × (dhead × nheads)
- Wk: dmodel × (dhead × nheads)
- Wv: dmodel × (dhead × nheads)
Typically in many transformer models (like GPT, BERT), dhead × nheads = dmodel. Given this relationship, the size of each of the matrices becomes:
- Wq: dmodel × dmodel
- Wk: dmodel × dmodel
- Wv: dmodel × dmodel
So, combining the three matrices, the total number of parameters for the Query, Key, and Value projections in the self-attention mechanism is: 3 × (dmodel × dmodel)
Self-Attention Mechanism:
- Query (Wq), Key (Wk), and Value (Wv) weights: 3 × (dmodel × dmodel)
- Output weight (Wo): dmodel × dmodel
- Biases for each of the above: 4 × dmodel
Total for Self-Attention: 4 × dmodel2 + 4 × dmodel
Feed-Forward Neural Network:
- First layer (expansion): dmodel × 4 × dmodel
- Second layer (compression): 4 × dmodel × dmodel
- Biases for each layer: 2 × 4 × dmodel
Total for Feed-Forward Network: 5 × dmodel2 + 8 × dmodel
Layer Normalization (typically two per block):
- Scale and shift: 2 × 2 × dmodel
Total for Layer Norm: 4 × dmodel
Combined:
Total parameters per block: 9 × dmodel2 + 16 × dmodel
For a model with nlayers layers, the total parameters just from the blocks would be:
Pblocks = nlayers × (9 × dmodel2 + 16 × dmodel)
If you include the token embeddings:
Ptokens = ntokens × dmodel
The total parameters for the model, excluding any other specialized embeddings or positional encodings, would be:
Ptotal = Pblocks + Ptokens
Note: This equation provides a simplified breakdown of where the parameters come from. However, specific implementations might have slight variations. Always refer to the architecture specifics for precise counts.
Parameters Calculation for GPT Models
Let's use our formula to calculate the model parameters and compare them with what is reported in literature.
- GPT-1:
- Parameters: 117 Million
- Decoder Layers: 12
- Context Token Size (which we'll assume to be equivalent to \(d_{\text{model}}\)): 768
- GPT-2:
- Parameters: 1.5 Billion
- Decoder Layers: 48
- Context Token Size (or \(d_{\text{model}}\)): 1600
- GPT-3:
- Parameters: 175 Billion
- Decoder Layers: 96
- Context Token Size (or \(d_{\text{model}}\)): 12288
Using the updated specifications and our previous formulas, let's re-calculate the parameters for each model's self-attention mechanism, feed-forward network, and other components.
GPT-1
Parameters per block (Pblock): 3 x 768^2 + 3 x 768 + 2 x 768^2 x 4 + 2 x 768 x 4 + 2 x 768
Parameters from embeddings (Pembeddings): 50,000 x 768
Total parameters (Ptotal): 12 x (Pblock) + Pembeddings
GPT-2
Parameters per block (Pblock): 3 x 1600^2 + 3 x 1600 + 2 x 1600^2 x 4 + 2 x 1600 x 4 + 2 x 1600
Parameters from embeddings (Pembeddings): 50,000 x 1600
Total parameters (Ptotal): 48 x (Pblock) + Pembeddings
GPT-3
Parameters per block (Pblock): 3 x 12288^2 + 3 x 12288 + 2 x 12288^2 x 4 + 2 x 12288 x 4 + 2 x 12288
Parameters from embeddings (Pembeddings): 50,000 x 12288
Total parameters (Ptotal): 96 x (Pblock) + Pembeddings
Using the updated specifications and the derived formulae:
- GPT-1: Approximately 116.4 million parameters
- GPT-2: Approximately 1.43 billion parameters
- GPT-3: Approximately 160.1 billion parameters
Comparing these with the reported numbers:
- GPT-1: Reported 117 million vs. Calculated 116.4 million
- GPT-2: Reported 1.5 billion vs. Calculated 1.43 billion
- GPT-3: Reported 175 billion vs. Calculated 160.1 billion
# Code for the formulas and calculations in text format for each model
model_params_count = {}
for model, specs in models_specs.items():
d_model = specs["d_model"]
n_layers = specs["n_layers"]
# Breakdown of the formula components for each model
model_params_count[model] = {
"P_block": f"3 x {d_model}^2 + 3 x {d_model} + 2 x {d_model}^2 x 4 + 2 x {d_model} x 4 + 2 x {d_model}",
"P_embeddings": f"{V} x {d_model}",
"P_total": f"{n_layers} x (P_block) + P_embeddings"
}
{'GPT-1': {'P_block': '3 x 768^2 + 3 x 768 + 2 x 768^2 x 4 + 2 x 768 x 4 + 2 x 768',
'P_embeddings': '50000 x 768',
'P_total': '12 x (P_block) + P_embeddings'},
'GPT-2': {'P_block': '3 x 1600^2 + 3 x 1600 + 2 x 1600^2 x 4 + 2 x 1600 x 4 + 2 x 1600',
'P_embeddings': '50000 x 1600',
'P_total': '48 x (P_block) + P_embeddings'},
'GPT-3': {'P_block': '3 x 12288^2 + 3 x 12288 + 2 x 12288^2 x 4 + 2 x 12288 x 4 + 2 x 12288',
'P_embeddings': '50000 x 12288',
'P_total': '96 x (P_block) + P_embeddings'}}