Intro to LLMs: Andrej Karpathy’s Makemore Part 2

Intro

This blog post is based on the second part of Andrej Karpathy’s makemore series. For part 1 check out my previous blog post.

We are building on Andrej’s name prediction model makemore that we started to develop in part 1.

Recap of Part 1

We created a matrix counting the bigrams (character pairs) in over 32,000 names. Each element in the matrix holds the count for one bigram, for example ‘.b’ in the third column in the first row occurs 1306 times. The ‘.’ character indicates beginning (and end) of a name. So we have 1306 names that start with character ‘b’.

Based on the counts we built a probability matrix, normalizing the counts relative to the sum of counts for each row. Then we developed a variation that uses a randomly initialized neural network and multipe forward-backward optimizations using backpropagation. The alternative approach yielded the same result.

The problem with the counting approach is that it does not yield good results as only one character is considered to calculate the probability of the next character. The table explodes in size if we consider more context e.g. three characters instead of one as the number of permutations increases exponentially with the characters in the context e.g. 27*27*27 for three characters.

Improving the model adding context

In this video we use an MLP to predict with more context. The approach is based on the paper ‘A Neural Probabilistic Language Model‘ from Bengio et. al. The paper uses a vocabulary of 17,000 words and tries to predict a word based on its context i.e. the previous words. The authors are using a Multi Layer Perceptron (MLP) model with backpropagation for training the network. Each word is embedded in a 30 dimensional feature vector. In other words, each of the 17,000 words in the model indexes into an embedding matrix C with 30 columns, ‘plucking out’ a feature vector for that word. The embeddings are tuned so that words that have similar meaning end up in a similar part of the vector space.

The next layer is a hidden layer. The size of the hidden layer is a hyperparameter i.e. the designer of the network can try different sizes of that hidden layer to evaluate the impact of the size of that layer on model performance. As each word considered has a feature vector of size 30, we have 90 numbers feeding into the hidden layer for three words of the context. A non-linearity is applied to the output of the hidden layer. The final layer has 17,000 outputs and 90 inputs generating a probability distribution for the following word. The parameters we optimize for when we train the model are the weights and biases for the hidden layer, output layer and the embeddings. When the model is trained many iterations over forward and backward passes maximize the probablity of the prediction for the ‘correct’ word that follows the three words that preceeded it. Inspired by this paper we will implementat a similar model for our name model ‘makemore’.

Reading the words and converting them to integers is the same code we already looked at.

Now we need to transform our list of names into contexts.

block_size = 3
X, Y = [], []
for w in words[:5]:
    print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '----->', itos[ix])
        context = context[1:] + [ix]

X = torch.tensor(X)
Y = torch.tensor(Y)

X holds the context comprising block_size=3 letters. Y is the letter following the context. We loop over the first 5 words to test this. In the outer loop covering all words, for each word the context is initialized as a list of zeros with length block_size. In the inner loop we iterate over all characters in each word plus an appended ‘.’ character. The character is translated to an integer. The current context list (initially [0,0,0]) is appended to our context list X (a list of lists). The number the current character maps to is appended to Y, our list of characters that follow each context. The elements of Y are also called the ‘labels‘, a term used for the known output category for the given input. At the end of the inner loop the context is ‘left shifted’. The leftmost character is dropped and the current character is added as the rightmost element of the list. At the end we convert our lists of lists X and Y to torch tensors. X and Y for the first five names look like this

with this shape and data type

Embedding the input

Now we need to think about how we can modify X so that we can feed it into a neural network. First we create the embedding matrix C that maps each character to a set of numerical values, in our simple case only 2.

C = torch.randn([27,2])

For each character we will pluck out a row of the embedding matrix C. Each row of C has two elements. We can index directly into C for example get the context for ‘e’ as C[5] or one-hot encode ‘e’ and multiply with C, treating the indexing as the first layer of our MLP.

What we really want to do though, is to map each character in the context matrix X to an embedding vector. For that we can index into the embedding matrix C with the context matrix X with (C[X]).

emb = C[X]

This is a little difficult to wrap your head around so here is a simple example that explains the idea

Each number in matrix a maps to the row of matrix b that it indexes into. This adds a dimension to the resulting matrix as each numerical value in matrix a is now represented by the corresponding row in matrix b that has two elements. So if we look at the first row of a with the values [3,4,1], in c=b[a] this maps to [[3,2], [6,7], [0,4]]. 3 maps to the third element in b, which is [3,2], 4 maps to the fourth element which is [6,7] and 1 maps to the first element with is [0,4]. Note that the index starts with 0. All of this indexing only works if the number of elements in matrix b matches the range of values in matrix a.

In Andrej’s example each number representing a character of the three context characters maps to two numbers from the embedding matrix C, so each character in each context maps to a two dimensional vector. The resulting matrix C[X] in our example has the dimensions 32x3x2, 32 contexts, 3 character numbers each, each number mapping to a vector of size 2.

Now, let’s build the neural network. For the first layer we want to input the embedded context, which is 3 characters, each embedded in two numbers. We chose to use 100 neurons for the hidden layer (the layer that does not have direct inputs and outputs), so the output is a vector of size 100.

W1 = torch.randn([6,100])
b1 = torch.randn(100)

As output of the first layer we want something along the lines of ’emb @ W1 + b1′. But emb is of shape 32x3x2 while we need 32×6. For each context we need to transform the embedded context of size 3×2 into one vector with 6 elements. Andrej now shows multiple options to accomplish this:

1. We extract the three embedding pairs for all contexts and concatenate them in the direction of the rows. The embedding of all first characters in the context is emb[:,0] with shape 32×2

emb[:,1] and emb[:,2] pull out the context for the second and third character and are of the same shape as emb[0,:]. And with torch’s cat function we can concatenate the embeddings for each character in the context

torch.cat([emb[:,0],emb[:,1],emb[:,2]], 1)

Note that we need to specify 1 as an argument to indicate that we want to concatenate over the three columns.

2. We can use the torch function ‘unbind’ to remove a tensor dimension. The dimension that is removed is the one that a separate tensor is generated for. Simple example to the rescue …

In Andrej’s example we have 32 rows of context, removing over columns (dimension 1) returns three tensors (shape 32×2) each with one tuple corresponding to the first, second and third letter in the context. If we concatenate these three we have the transformation we are looking for

torch.cat(torch.unbind(emb, 1), 1)

3. We can use torch.view that offers different views on the same set of data while the data itself is not copied and rearranged. In our example we can ask pytorch to view emb as a 32×6 matrix

emb.view(emb.shape[0], -1)

We are using emb.shape[0] as we want to avoid hardcoding numbers. As our matrix emb has the dimensions 32x3x2 we just have to specify that we want 32 rows. By passing only one additional argument we imply that we want to convert to a two dimensional array and ‘-1’ tells Pytorch to automatically figure out the second dimension, which can only be 6. We could also pass the number of columns as -1 and have Pytorch figure out the number of rows

emb.view(-1, 6)

To learn about the computational efficiency of view you can have a look here.

So we can feed the context matrix of size 32x3x2 into our hidden layer of neurons and apply the non-linearity in a single line

h = torch.tanh(emb.view(-1, 6) @ W1 + b1)

One thing we need to be mindful of when adding b1 is replication. b1 is an array of size 100

and

If there is a dimension mismatch, Pytorch aligns the size of the array with the matrix on the right, in our example the columns of emb and adds a 1 as a row dimension. Then it replicates/broadcasts that matrix of size 1×100 32 times over all rows to match the dimensions.

For the output layer we define W2 of size 100 x 27 (we have 100 input values and 27 output values which are the prediction of the next letter). b2 is of size 27 and will be broadcasted just like b1 in the hidden layer.

W2 = torch.randn((100, 27))
b2 = torch.randn(27)
logits = h @ W2 + b2

Calculating probabilities and loss

As we assume that the logits are logs of counts we calculate the exponents for each matrix element and then normalize to the sum of each row to get probabilities (we are effectively applying the softmax function here).

counts = logits.exp()
prob = counts / counts.sum(1, keepdim = True)

Now we want to get a sense of the quality of the model, which will of course be really bad as we have only randomly initialized variables. We first look at Y, the vector that has the character that follows each set of three characters X. The probability for the correct character in the sequence is

prob[torch.arange(32), Y]

We index into the each row of prob with the number that represents the following character. To calculate the loss as defined in part 1 we apply the log function to each probability, add up the logs and divide by the number of probabilities i.e. apply the mean and multiply with (-1) as we want to minimize the negative log likelihood.

loss = -prob[torch.arange(32), Y].log().mean()

Now Andrej cleans up what we have done so far and is introducing the generator so that we have consistent random numbers

g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27,2), generator = g, requires_grad=True)
W1 = torch.randn((6,100), generator = g, requires_grad=True)
b1 = torch.randn((100), generator = g, requires_grad=True)
W2 = torch.randn((100, 27), generator = g, requires_grad=True)
b2 = torch.randn((27), generator = g, requires_grad=True)
parameters = [C, W1, b1, W2, b2]

emb = C[X]
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
logits = h @ W2 + b2
counts = logits.exp()
prob = counts / counts.sum()
loss = -prob[torch.arange(32), Y].log().mean()
loss

We can eliminate

counts = logits.exp()
prob = counts / counts.sum()
loss = -prob[torch.arange(32), Y].log().mean()

By using the cross entropy loss function

loss = F.cross_entropy(logits, Y)

and with that we get

emb = C[X]
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Y)

We should always prefer cross_entropy over the manual loss calculation. The implementation of the cross_entropy function and the calculation of its gradient in the Jupyter kernel is optimized and runs very efficiently. The other benefit we get is numerical stability. If we have very large logits, for example 100, the result of the exponentiation is a number that is too large and cannot be handled by Jupyter/Pytorch. If we subtract a fixed number from the array of logits the probability distribution does not change as subtracting a fixed number is equivalent to dividing e^<fixed number>. As this affects the numerator and the denominator in the fraction prob = counts / counts.sum() it does not change the probabilities. When using cross_entropy(logits, Y) Pytorch automatically figures out the largest logit and subtracts it ensuring computational stability.

Optimizing the model

Now we can iterate over the model with forward and backward pass. First we need to indicate the we want to calculate the parameters for all the gradients

for p in parameters:
    p.requires_grad = True

Next we run 10 iterations of forward and backward pass

for _ in range(10):
    # forward pass
    emb = C[X]
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y)
    print(loss.item())
    for p in parameters:
        p.grad = None
    loss.backward()
    for p in parameters:
        p.data += -p.grad * .1

And we can bring down the loss to really low values

Because we have a large number of parameters and only 32 inputs is it easy to train the model. We are ‘overfitting’ here, which means we are building a model that works well on the training set but will not generalize. Overfitting is a machine learning phenomenon that occurs when a model is too closely trained to a specific set of data, resulting in poor performance on new data.

Andrej also points out that we cannot bring down the loss to zero. Zero loss means that every context input generates the right output. If we look at the predicted following character we can call the max function over all columns. The indices section of the output shows the index with the highest logit value which is the same as the index for the character with the highest probability

We see that for the first context ‘…’ index 19 is predicted. But the label Y[0] is actually 5. It is impossible to achieve 100% accuracy as we have ambiguity here. For example the context ‘…’ at the beginning is followed by e[mma], o[livia], a[va], i[sabella], s[ophia].

Now we switch to the full data set rather than the first 5 words. So we scroll up to where we defined the data set and change

for w in words[:5]

for w in words

to run over the full data set. We now have 228,146 input contexts.

and we see how our optimization improves our loss

Introducing minibatches

As we are using a large number of input values requiring a lot of computational work, the model is relatively slow. In practice you would not run forward and backward pass and the update on the entire data set. Instead you would use minibatches, using just a small subset of inputs for each iteration. For minibatches the quality of the gradient is less but still useful. It is better to have an approximate gradient and run through more iterations than to have the exact gradient and take fewer steps. We add this line that gets us 32 indices for each batch so our input is of shape 32x3x2.

ix = torch.randint(0, X.shape[0], (32,))

So our code now looks like this

for _ in range(100):
    ix = torch.randint(0, X.shape[0], (32,))
    emb = C[X[ix]]
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y[ix])

    for p in parameters:
        p.grad = None
    loss.backward()
    for p in parameters:
        p.data += -p.grad * .1

print(loss.item())

Note that the loss is also calculated using only the respective outputs Y[ix]. And rerunning it we see almost instance results.

Finding a good learning rate

Next we are trying to find a good learning rate. Through trial and error Andrej finds that a good learning rate is between 0.001 and 1. At the lower end of the learning rate the loss barely decreases. At the higher end of the learning rate the loss does not converge. We create a linear space over the corresponding exponents with the basis of 10 that covers the 0.001 to 1.0 interval.

lre = torch.linspace(-3, 0, 1000)
lrs = 10**lre

This graph shows how we are spaced exponentially.

Now we change our code to keep track of the learning rate and the corresponding loss as we iterate. We run 1000 iterations and after each iteration we pick the next learning rate from the list of the exponentially spaced learning rates lrs that we defined above.

lri = []          # List to save the learning rates
lossi = []        # List to save the loss

for i in range(1000):
    ix = torch.randint(0, X.shape[0], (32,))
    emb = C[X[ix]]
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y[ix])
    
    for p in parameters:
        p.grad = None
    loss.backward()
    lr = lrs[i]
    for p in parameters:
        p.data += -p.grad * lr

    lri.append(lre[i])
    lossi.append(loss.item())

plt.plot(lri, lossi)

At the end we plot the loss over the exponent of the learning rate and get a graph like this

The graph shows a good learning rate for the learning rate exponent -1 which means a learning rate of 10**-1=0.1. So we can feel confident that 0.1 is a good learning rate. If we run the model with multiple times with 10,000 iterations each time and then reduce the learning rate to 0.01 we get a loss of around 2.3, below the 2.45 we achieved for the bigram.

So to find the right learning rate we need to …

Establish a lower and upper bound by finding out when the loss barely decreases over many iterations (~1000) and when it explodes
Create an exponential learning rate array covering the range you determined above and increase the learning rate drawing from that space. Plot the loss over the learning rate exponent
Find the ‘sweet spot’ for the learning rate from the graph

Preventing overfitting

As the capacity/complexity of the neural network grows it becomes more capable of overfitting the training set. The loss can become really low but the model just memorizes the training set. If you apply an overfitted model on new data the loss will be very high. So it makes sense to ‘withhold’ data that is only used for testing. You typically break up your training data into a training split (80%), a dev/validation split (10%) and a test split (10%). The training set is used to optimize the parameters of the model. The development/validation split is used to compare the results of the trained model for different hyperparameters e.g. hidden layer size. The test split is used to evaluate the performance of the model in the end. You can only test on a test split a few times as you want to avoid training for the test set as well, leading to a model that overfits.

To generate training, validation and test split we write the dataset generation as a function that takes a list of words as input and returns a context matrix and the labels. We shuffle the words and call the function for each split, with 80% of the words in the training split, 10% of the words in the validation split and 10% in the test split.

def build_dataset(words):
    block_size = 3
    X, Y = [], []
    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            print(''.join(itos[i] for i in context), '----->', itos[ix])
            context = context[1:] + [ix]
    
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print(X.shape, Y.shape)
    return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

Note that due to the random shuffle we get a different mix of words with a different number of letters and thus context for each letter. This means that the size of Xtr, Xdev and Xte is always slightly different.

With a training set that is different from the test set we can validate our model and we see that our loss on the dev set is similar to our loss on the training set.

The loss calculated from the training set is pretty close to the loss that we calculated for the dev set i.e. the data we did not train on. The model is not powerful enough to memorize the data. We are underfitting. This normally means that the network is small.

Improving the model

Now let’s increase the size if the hidden layer to 300 neurons and run over 30,000 iterations with a learning rate of .1

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 300), generator=g)
b1 = torch.randn(300, generator=g)
W2 = torch.randn((300, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

for p in parameters:
  p.requires_grad = True

lri = []
lossi = []
stepi = []

for i in range(30000):
  
  # minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (32,))
  
  # forward pass
  emb = C[Xtr[ix]] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Ytr[ix])
  
  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()
  
  lr = 0.1
  for p in parameters:
    p.data += -lr * p.grad

  # track stats
  stepi.append(i)
  lossi.append(loss.item())

and plot the result

We see that the minibatches create a bit of noise in this model. Our loss is 2.51 in line with Andrej’s results.

Andrej’s conclusion here is that we still have not optimized this network very well. The next iteration I ran brings the loss down to 2.43

We reduce the learning rate to 0.05 and reduce the loss to 2.32. Further iterations I ran brought it down to 2.27. The loss only decreases slowly. It could be that the embeddings are the bottleneck of the network. Maybe we are cramming too many characters into two dimensions. To see if the embeddings are the issue let’s visualize them. This is obviously only possible as long as we have two embeddings.

plt.figure(figsize=(8,8))
plt.scatter(C[:,0].data, C[:,1].data, s=200)
for i in range(C.shape[0]):
    plt.text(C[i,0].item(), C[i,1].item(), itos[i], ha="center", va="center", color="white")
plt.grid('minor')

shows this scatter plot

i, e, a, o are kind of clustered so the neural net thinks that these are similar.

Now we will scale up the embedding size to see how that could impact model performance. We move to 10 embeddings and reduce the hidden layer size to 200.

The logartithmic loss that avoids the steep hockey stick like curve looks like this

and we get a loss of 2.33 for the training and 2.35 for the dev set

After reducing the learning rate to 0.01 and running another 50,000 iterations we get 2.168 for the training and 2.196 for the dev set

The fact the loss for the dev set is slightly higher than the training set is an indication of us starting to overfit, meaning that we have more parameters that work very well for the test set and not quite as well for the dev data set the model was not trained on.

Andrej points out that you typically would run lots of experiments on the training sets, trying different hyperparameters. Then you would look which ones give the best performance for the dev data set. Then you do a final run on the test set and that is the number you share to brag about your model.

The last variation we try goes over 200,000 iterations and adapts the step size from 0.1 to 0.01 after 100,000 iterations.

We get 2.133 for the training and 2.176 for the dev set.

Other parameters to play with are a change of the number of neurons in the hidden layer, further changes to the dimensions of the embedding or increase in context size. We can also change the optimization itself like the number of iterations, the batch size and the learning rate and learning rate decay.

Sampling from the model – Inference

Lastly we are looking at how to use this model for predictions (inference). We ‘predict’ 20 names(for loop). For each name we initialize with an empty (zero) context of block size. We enter an endless loop that we break out of later when an end character ‘.’ is predicted (ix==0). We first calculate the embedding of the context for the letter. This generates a three dimensional array of size (1, context_size, embedding_size). We then flatten this three dimensional array to a single dimension – emb.view(1, -1)) – and feed that into the input layer. The output is passed to the output layer that outputs the logits. These are the same steps we went through for the forward pass with the only difference that our input vector is only a single context. We apply the softmax function to translate the logits to a probability distribution. We sample one value from that distribution. We add that character to the right of the context that will be used for predicting the next character. The leftmost character in the context is dropped. We append the ‘predicted’ character to the array of characters for the current name. We break out of the endless loop when the dot as end character is encountered. In the end we run a list comprehension over all numbers in a name, translate the numbers to characters and joint them to print the name.

# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      emb = C[torch.tensor([context])] # (1,block_size,d)
      h = torch.tanh(emb.view(1, -1) @ W1 + b1)
      logits = h @ W2 + b2
      probs = F.softmax(logits, dim=1)
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      context = context[1:] + [ix]
      out.append(ix)
      if ix == 0:
        break
    
    print(''.join(itos[i] for i in out))

Particularly in comparison with the original bigram model we see that the output is more ‘name like’ which is an indication of the fact that our model quality has improved.

What I learned today

Intro to LLMs: Andrej Karpathy’s Makemore Part 2

Leave a ReplyCancel reply

Intro to LLMs: Andrej Karpathy’s Makemore Part 2

Leave a ReplyCancel reply

Discover more from What I learned today