Estimating model parameters for transformer models

Author: Asim Kadav

Self-Attention Mechanism Parameters Calculation

For recent projects, I needed a simple way of *estimating* total model parameters for Transformers. I propose a method (using other well known techniques), and compare it with how the method compares with real world parameter counts of GPT* models.

In the self-attention mechanism of the transformer, the input representation is projected into three different representations: Query (Wq), Key (Wk), and Value (Wv).

Each of these projections has its own weight matrix.

Given:

The projections are as follows:

Typically in many transformer models (like GPT, BERT), dhead × nheads = dmodel. Given this relationship, the size of each of the matrices becomes:

So, combining the three matrices, the total number of parameters for the Query, Key, and Value projections in the self-attention mechanism is: 3 × (dmodel × dmodel)

Self-Attention Mechanism:

Total for Self-Attention: 4 × dmodel2 + 4 × dmodel

Feed-Forward Neural Network:

Total for Feed-Forward Network: 5 × dmodel2 + 8 × dmodel

Layer Normalization (typically two per block):

Total for Layer Norm: 4 × dmodel

Combined:

Total parameters per block: 9 × dmodel2 + 16 × dmodel

For a model with nlayers layers, the total parameters just from the blocks would be:

Pblocks = nlayers × (9 × dmodel2 + 16 × dmodel)

If you include the token embeddings:

Ptokens = ntokens × dmodel

The total parameters for the model, excluding any other specialized embeddings or positional encodings, would be:

Ptotal = Pblocks + Ptokens

Note: This equation provides a simplified breakdown of where the parameters come from. However, specific implementations might have slight variations. Always refer to the architecture specifics for precise counts.

Parameters Calculation for GPT Models

Let's use our formula to calculate the model parameters and compare them with what is reported in literature.

  1. GPT-1:
    • Parameters: 117 Million
    • Decoder Layers: 12
    • Context Token Size (which we'll assume to be equivalent to \(d_{\text{model}}\)): 768
  2. GPT-2:
    • Parameters: 1.5 Billion
    • Decoder Layers: 48
    • Context Token Size (or \(d_{\text{model}}\)): 1600
  3. GPT-3:
    • Parameters: 175 Billion
    • Decoder Layers: 96
    • Context Token Size (or \(d_{\text{model}}\)): 12288

Using the updated specifications and our previous formulas, let's re-calculate the parameters for each model's self-attention mechanism, feed-forward network, and other components.

GPT-1

Parameters per block (Pblock): 3 x 768^2 + 3 x 768 + 2 x 768^2 x 4 + 2 x 768 x 4 + 2 x 768

Parameters from embeddings (Pembeddings): 50,000 x 768

Total parameters (Ptotal): 12 x (Pblock) + Pembeddings

GPT-2

Parameters per block (Pblock): 3 x 1600^2 + 3 x 1600 + 2 x 1600^2 x 4 + 2 x 1600 x 4 + 2 x 1600

Parameters from embeddings (Pembeddings): 50,000 x 1600

Total parameters (Ptotal): 48 x (Pblock) + Pembeddings

GPT-3

Parameters per block (Pblock): 3 x 12288^2 + 3 x 12288 + 2 x 12288^2 x 4 + 2 x 12288 x 4 + 2 x 12288

Parameters from embeddings (Pembeddings): 50,000 x 12288

Total parameters (Ptotal): 96 x (Pblock) + Pembeddings

Using the updated specifications and the derived formulae:

  1. GPT-1: Approximately 116.4 million parameters
  2. GPT-2: Approximately 1.43 billion parameters
  3. GPT-3: Approximately 160.1 billion parameters

Comparing these with the reported numbers:

 
    
# Code for the formulas and calculations in text format for each model

model_params_count = {}

for model, specs in models_specs.items():
    d_model = specs["d_model"]
    n_layers = specs["n_layers"]
    
    # Breakdown of the formula components for each model
    model_params_count[model] = {
        "P_block": f"3 x {d_model}^2 + 3 x {d_model} + 2 x {d_model}^2 x 4 + 2 x {d_model} x 4 + 2 x {d_model}",
        "P_embeddings": f"{V} x {d_model}",
        "P_total": f"{n_layers} x (P_block) + P_embeddings"
    }


{'GPT-1': {'P_block': '3 x 768^2 + 3 x 768 + 2 x 768^2 x 4 + 2 x 768 x 4 + 2 x 768',
  'P_embeddings': '50000 x 768',
  'P_total': '12 x (P_block) + P_embeddings'},
 'GPT-2': {'P_block': '3 x 1600^2 + 3 x 1600 + 2 x 1600^2 x 4 + 2 x 1600 x 4 + 2 x 1600',
  'P_embeddings': '50000 x 1600',
  'P_total': '48 x (P_block) + P_embeddings'},
 'GPT-3': {'P_block': '3 x 12288^2 + 3 x 12288 + 2 x 12288^2 x 4 + 2 x 12288 x 4 + 2 x 12288',
  'P_embeddings': '50000 x 12288',
  'P_total': '96 x (P_block) + P_embeddings'}}

   

[1] LLM Paramter Count

[2] How is llama.cpp possible

Posted at 8:12 PM on Saturday, 2nd of September, 2023