Estimating model parameters for transformer models

Self-Attention Mechanism Parameters Calculation

For recent projects, I needed a simple way of *estimating* total model parameters for Transformers. I propose a method (using other well known techniques), and compare it with how the method compares with real world parameter counts of GPT* models.

In the self-attention mechanism of the transformer, the input representation is projected into three different representations: Query (W_q), Key (W_k), and Value (W_v).

Each of these projections has its own weight matrix.

Given:

d_model: the dimension of the model's input (or the embedding size).
d_head: the dimension of each attention head.
n_heads: the number of attention heads.

The projections are as follows:

W_q: d_model × (d_head × n_heads)
W_k: d_model × (d_head × n_heads)
W_v: d_model × (d_head × n_heads)

Typically in many transformer models (like GPT, BERT), d_head × n_heads = d_model. Given this relationship, the size of each of the matrices becomes:

W_q: d_model × d_model
W_k: d_model × d_model
W_v: d_model × d_model

So, combining the three matrices, the total number of parameters for the Query, Key, and Value projections in the self-attention mechanism is: 3 × (d_model × d_model)

Self-Attention Mechanism:

Query (W_q), Key (W_k), and Value (W_v) weights: 3 × (d_model × d_model)
Output weight (W_o): d_model × d_model
Biases for each of the above: 4 × d_model

Total for Self-Attention: 4 × d_model² + 4 × d_model

Feed-Forward Neural Network:

First layer (expansion): d_model × 4 × d_model
Second layer (compression): 4 × d_model × d_model
Biases for each layer: 2 × 4 × d_model

Total for Feed-Forward Network: 5 × d_model² + 8 × d_model

Layer Normalization (typically two per block):

Scale and shift: 2 × 2 × d_model

Total for Layer Norm: 4 × d_model

Combined:

Total parameters per block: 9 × d_model² + 16 × d_model

For a model with n_layers layers, the total parameters just from the blocks would be:

P_blocks = n_layers × (9 × d_model² + 16 × d_model)

If you include the token embeddings:

P_tokens = n_tokens × d_model

The total parameters for the model, excluding any other specialized embeddings or positional encodings, would be:

P_total = P_blocks + P_tokens

Note: This equation provides a simplified breakdown of where the parameters come from. However, specific implementations might have slight variations. Always refer to the architecture specifics for precise counts.

Parameters Calculation for GPT Models

Let's use our formula to calculate the model parameters and compare them with what is reported in literature.

GPT-1:
- Parameters: 117 Million
- Decoder Layers: 12
- Context Token Size (which we'll assume to be equivalent to \(d_{\text{model}}\)): 768
GPT-2:
- Parameters: 1.5 Billion
- Decoder Layers: 48
- Context Token Size (or \(d_{\text{model}}\)): 1600
GPT-3:
- Parameters: 175 Billion
- Decoder Layers: 96
- Context Token Size (or \(d_{\text{model}}\)): 12288

Using the updated specifications and our previous formulas, let's re-calculate the parameters for each model's self-attention mechanism, feed-forward network, and other components.

GPT-1

Parameters per block (P_block): 3 x 768^2 + 3 x 768 + 2 x 768^2 x 4 + 2 x 768 x 4 + 2 x 768

Parameters from embeddings (P_embeddings): 50,000 x 768

Total parameters (P_total): 12 x (P_block) + P_embeddings

GPT-2

Parameters per block (P_block): 3 x 1600^2 + 3 x 1600 + 2 x 1600^2 x 4 + 2 x 1600 x 4 + 2 x 1600

Parameters from embeddings (P_embeddings): 50,000 x 1600

Total parameters (P_total): 48 x (P_block) + P_embeddings

GPT-3

Parameters per block (P_block): 3 x 12288^2 + 3 x 12288 + 2 x 12288^2 x 4 + 2 x 12288 x 4 + 2 x 12288

Parameters from embeddings (P_embeddings): 50,000 x 12288

Total parameters (P_total): 96 x (P_block) + P_embeddings

Using the updated specifications and the derived formulae:

GPT-1: Approximately 116.4 million parameters
GPT-2: Approximately 1.43 billion parameters
GPT-3: Approximately 160.1 billion parameters

Comparing these with the reported numbers:

GPT-1: Reported 117 million vs. Calculated 116.4 million
GPT-2: Reported 1.5 billion vs. Calculated 1.43 billion
GPT-3: Reported 175 billion vs. Calculated 160.1 billion

 
    
# Code for the formulas and calculations in text format for each model

model_params_count = {}

for model, specs in models_specs.items():
    d_model = specs["d_model"]
    n_layers = specs["n_layers"]
    
    # Breakdown of the formula components for each model
    model_params_count[model] = {
        "P_block": f"3 x {d_model}^2 + 3 x {d_model} + 2 x {d_model}^2 x 4 + 2 x {d_model} x 4 + 2 x {d_model}",
        "P_embeddings": f"{V} x {d_model}",
        "P_total": f"{n_layers} x (P_block) + P_embeddings"
    }


{'GPT-1': {'P_block': '3 x 768^2 + 3 x 768 + 2 x 768^2 x 4 + 2 x 768 x 4 + 2 x 768',
  'P_embeddings': '50000 x 768',
  'P_total': '12 x (P_block) + P_embeddings'},
 'GPT-2': {'P_block': '3 x 1600^2 + 3 x 1600 + 2 x 1600^2 x 4 + 2 x 1600 x 4 + 2 x 1600',
  'P_embeddings': '50000 x 1600',
  'P_total': '48 x (P_block) + P_embeddings'},
 'GPT-3': {'P_block': '3 x 12288^2 + 3 x 12288 + 2 x 12288^2 x 4 + 2 x 12288 x 4 + 2 x 12288',
  'P_embeddings': '50000 x 12288',
  'P_total': '96 x (P_block) + P_embeddings'}}

Relevant links

[1] LLM Paramter Count

[2] How is llama.cpp possible

Posted at 8:12 PM on Saturday, 2nd of September, 2023