Intro to LLMs: Andrej Karpathy’s Micrograd

Intro

This blog post is based on a video by gen AI guru Andrej Karpathy.

In this video Andrej implements a framework called Micrograd for for building computational graphs that support backpropagation. Backpropagation calculates the gradients of parameters used in a computational graph to quantify how a small change of each parameter affects the final output of a graph. “Backpropagation is at the mathematical core of any modern Deep Learning library like Pytorch or Jax” (Andrej). It will all become clear as you read on …

To follow along you should have access to a Jupyter notebook or to Google Colab. The code Andrey is building is available here but I would encourage you to type as you follow along rather than just copy-paste. If you are not sure how to get started with configuring a runtime, please have a look at my blog post on Andrej’s makemore part 1 that describes the setup of Jupyter and Colab at the beginning.

The Derivative

After importing the ‘standard’ set of libraries we define a simple quadratic function that we plot over the range from -5 to 5 with an interval of.25.

As discussed in the previous section … if we want to calculate the slope of the function at x=3 we increase x by a little, lets say h=0.001 and calculate by how much the the function increases. That increase divided by the step-width is the slope of the function at x=3. That slope is also called the derivative. In the image h is actually bigger than 0.001 but it illustrates the concept that for small values of h the slope/derivative for f(x), f'(x) can be calculated as

$f'( x) =\frac{f( x+h) -f( x)}{h}$

In this case the slope at 3 is 14.

If we differentiate the formula in our head following the basic rules of differentiation we can verify that 14 is the correct number as the derivative of 3*x^2 – 4*x + 5 is 6*x-4 which is 14 for x=3. At x=2/3 the derivative is 0.

Now let’s build our first graph. We want to find out how a small change in a affects the result d(1/2)

This makes sense as per the basic rules of differentiation the derivative of a*b+c with respect to a is b, which we initialized to -3. Analogous, the derivative of a*b+c with respect to b is a which was initialized to 2.0. If c is increased the result of a*b+c is increased by c as well so the derivative is 1.

Building a ‘Value’ object

First some basic information on Python functions that start with “__”. In Python, “__” functions, also called “dunder methods” (double underscore methods), are special built-in methods that are used to define how objects behave when operations like addition, multiplication or comparison are applied.

__init__ is used when an object instance is initialized. In our case we assign a number to the data member of the Value object. __repr__ is called when the instance name is entered. It is typically used to display the state of an instance, in our case returning the value of data.

Now we declare methods for our objects that define the behavior of the operators we need. First we implement the __add__ function that is implicitly called when two Value objects are combined with the ‘+’ operator. The syntax for __add__ is

__add__(self, other)

When two object instances are combined through ‘+’, self is a reference to the instance on the left of ‘+’, other is a reference to the instance to the right of ‘+’. The function generates a new instance of Value initialized with its data member set to the sum of the data of self and other

    def __add__(self, other):
        
        out = Value(self.data + other.data)
        return out

We do the same for multiply. And make sure that the operators work:

Now we introduce the concept of children. If an object is the result of an ‘+’ or ‘*’ operation we make the object remember the ‘original’ objects the operator was applied to as ‘children’. This is key for backpropagation as we need to build a tree that we can eventually traverse backwards. Each object now has a variable (_children) that remembers the two values that were combined for the new value object and the operator (_op) that was used to combine the two values. Note that the ‘out = Value()’ assignment in __add__ and __mul__ implicitly calls __init__ to create and initialize the result object, passing the operands and operator. We also add a variable ‘_op’ to the object that tracks the operation itself.

Building and visualizing a graph

We now use the open source graph visualization software Graphviz to visualize this simple graph. First we define ‘trace’, a helper function that generates a list of all nodes (nodes set) and their dependencies (edges set). The edges set contains the two-element tuples for each child and its parent.

def trace(root):
    nodes, edges = set(), set()
    def build(v):
        print("Called build for", v.data)
        nodes.add(v)
        print("Saving node:", v.data)
        if (len(v._prev) == 0):
            print("No children - Returning from node", v.data)
            return
        else:
            for child in v._prev:
                edges.add((child, v))
                print(f"Adding tuple: {v.data}, {child.data}")
                build(child)
    build(root)
    return nodes, edges

‘trace’ initializes nodes and edges as empty sets and defines a build function for generating the nodes and edges set. It adds the node it is called for (e.g. d with d.data=4) to the node set. If the node has children (e.g. e with e.data=-6) we enter the for loop. In the for loop the node (d) and its first child (e) are added as a tuple to edges. The function calls build with the first child (e) as an argument. This is a recursion, meaning that we are in a function that calls itself. The child (e) is added to the node set and if that child has children of its own, the child/child-child tuple (e,b) is added to the edges set and build is called for the child-child (b) etc. Once we hit a node without children of its own (b) we print “No children – Returning from node” and that invocation of the build function returns. Execution continues with the next element (the second child, a) of the previous for loop in the nested stack of loops.

from graphviz import Digraph

def draw_dot(root):
    dot = Digraph(format='svg', graph_attr={'rankdir' : 'LR'})
    nodes, edges = trace(root)

    for n in nodes:
        uid = str(id(n))
        print(uid)
        # create drawing nodes dot.node for all nodes
        dot.node(name=uid, label = "{ %s | data %.4f }" % (n.label, n.data, ), shape='record')
        if n._op:
            # if the node is the result of an operation create an operation node
            dot.node(name = uid + n._op, label = n._op)
            # connect the operations node to the result node
            dot.edge(uid + n._op, uid)

    for n1, n2 in edges:
        # for all tuples connect the child (first element in the tuple) with the operator node
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)

    return dot

‘draw_dot’ first calls trace that returns the set of nodes and edge tuples. It loops over all nodes in the node set and creates a dot.node object for each node. It uses the object’s uid as identifier. uid is generated with Python’s id() function that returns an object’s memory address. If the node is the result of an operation it has n._op set to the operator (e.g. ‘+’) used to generate it. We create an additional dot.node for the operation that just contains the operator as label. We append the operator to the uid string we used as name for the node and assign that as a name to the operator node. dot.edge connects the operator node (name “uid+n_op”) with the result node (name “uid”). So we have generated all nodes and we have connected the operator nodes to the ‘result’ nodes.

What we are missing is the connection from the edge nodes to the operator nodes. The second loop (for n1, n2 in edges) loops over all edge tuples. Remember … each tuple contains the child first and then its parent as we added them in that order in trace

edges.add((child, v))

So for each child (first element in the tuple) dot.edge connects that child identified by its id (converted to a string) to the operator node identifed by the parent id + operator (second element in the tuple converted to a string with the appended operator stored in the parent).

Note: The Graphviz package needs to be installed on your system. On Ubuntu based systems you can install it with

sudo apt install graphviz

We add another operation at the end of the graph and call the end result L for loss

f = Value(-2.0, label='f') 
L=d*f; L.label='L'

and the new graph looks like this

Adding gradients to the graph

Now we want to calculate the how each variable impacts the loss L. For each Value object we add a variable that maintains the derivative of that node with respect to the output. At initialization we assume no effect of the variable on the loss so we initialize with 0.

We start calculating the derivative from right to left. The derivative from L with respect to L is 1. The derivative for the loss relative to d is relatively easy to figure out. With

$f'( x) =\frac{f( x+h) -f( x)}{h}$

and

$L=d*f$

$\frac{dL}{dd} =\ \frac{( d+h) *f\ -\ d*f}{h} =\frac{h*f}{h} =f=4$

Analogous

$\frac{dL}{df} =d=-2$

Now we want to understand the derivative of the loss with respect to c

$\frac{dL}{dc}$

If we know the impact c has on d and the impact of d on L we should be able to put together how c impacts L. Let’s first look at the impact of c on d

$d=e+c$

$\frac{dd}{dc} =\frac{( (e+(c+h)) - (e+c)}{h} =1$

Intuitively it makes sense that the derivative of a sum with respect to one element is 1. If we increase the value of the element by a small amount the result of the addition changes by the same amount. So the slope for an addition is always 1. So now we have the ‘local’ derivative, meaning the derivative of c with respect to d. But how do we calculate the derivative of c with respect to L i.e. the impact of a small change of c on L?

We can accomplish that by applying the chain rule. The chain rule says that the derivative of a function f that is applied to another function g(x) with respect to x is the product of the derivative of f with respect to g and g with respect to x

$\frac{df( g( x))}{dx} =\frac{df( g(x))}{dg(x)} * \frac{dg( x)}{dx}$

In our example we want the derivative of L with respect to c. So our x is c. The ‘outer’ operation is our g

$g( c) =c+e$

and the following operation is our f

$f( g( c)) =g( c) *f=L(c)$

so

$\frac{dL( c)}{dc} =\frac{df( g( c))}{dg(c)} *\frac{dg( c)}{dc}$

and we know

$\frac{df( g( c))}{dg(c)} =f=\ -2.0$

and

$\frac{dg( c)}{dc} =1$

so that

$\frac{dL( c)}{dc} =-2.0$

As $d=g(c)$ and $L=f( g( c))$ we can also write this as

$\frac{dL}{dc} =\frac{dL}{dd} *\frac{dd}{dc}$

We can generalize this example:

The gradient of a node is the product of the gradient of the operation the node is an operand of (local gradient) and the gradient of the node that is the result of that operation.

Following this concept we calculate the rest of the tree and it looks like this

Building a simple neuron

A neuron in a neural network comprises a vector with elements that are also referred to as weights, a bias and a non-linearity. First the dot product of that vector and an input vector x is calculated. The bias is added to the output of that dot product. Then a non-linear activation function is applied to the result of dot product and added bias. For example for three inputs the output before applying the non-linearity looks like this

$z\ =\ [ x_{0} \ x_{1} \ x_{2}] \cdotp \begin{bmatrix} w_{0}\\w_{1}\\w_{2} \end{bmatrix} +b = w_{0} x_{0} +w_{1} x_{1} +w_{2} x_{2} +b$

The non-linear activation function is then applied to z. We are adding a non-linearity to the neural network to enable it to learn. This interesting blog article (“Why Are Neural Nets Non-linear?“) compares the paradigm of classical programming with deep learning. In classical programming we have instructions that manipulate data and instructions that control the program flow by making decisions. In deep learning the data manipulation is through the linear layers and the decision making is done by the non-linear layers “by dropping data points that are less relevant than others”.

We write this out for the simple case of two inputs but cannot (yet) apply the non-linearity as we are missing a non-linear operation in our Value object.

# neuron - conceptual model

# input x1, x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')

# weights w1, w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')

# neuron bias
# b = Value(6.7, label='b')
# b = Value(8.0, label='b')
#b = Value(6.88137358, label='b')
b = Value(6.7, label='b')

# x1 * w1 + x2 * w2 + b
x1w1 = x1 * w1; x1w1.label = 'x1 * w1'
x2w2 = x2 * w2; x2w2.label = 'x2 * w2'

# x1 * w1 + x2 * w2
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1 * w1 + x2 * w2'
# x1 * w1 + x2 * w2 + b
n = x1w1x2w2 + b; n.label = 'n'

# output (with activation function)
# o = n.tanh(); o.label = 'o'
mygraph = draw_dot(o)

Now we add a non-linearity. One common non-linear activation functions is tanh(), which our Value object does not yet support. tanh can be expressed as

$tanh( x) =\frac{e^{2x} -1}{e^{2x} +1}$

tanh() caps the value of really large numbers and de-emphasizes their impact on the output of the neuron.

So we add the tanh operation to the Value object definition

    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
        out = Value(t, (self,), 'tanh')
        return out

And we can uncomment the line containing ‘o = n.tanh()’ above

# output (with activation function)
o = n.tanh(); o.label = 'o'

and get this graph

Backpropagation

Now if we want to backpropagate through the graph with the added tanh non-linearity at the end we need to find the derivative of tanh. Before we do that we change the bias b to 6.8813735870195432 to have some ‘nice’ numbers as we backpropagate. The Wikipedia page for tanh shows that the derivative of tanh is

$\frac{d}{dx} tanh( x) =1-tanh^{2}( x)$

Which is convenient as we already calculated tanh. So

o.grad = 1.0
n.grad = 1 - (o.data)**2

Following the rules applied for backpropagation previously we get these derivatives

o.grad = 1.0
n.grad = 1 - (o.data)**2
b.grad = 0.5
x1w1x2w2.grad = 0.5
x1w1.grad = 0.5
x2w2.grad = 0.5
w1.grad = 1.0
w2.grad = 0.0
x1.grad = -1.5
x2.grad = 0.5

Now let’s look at how we can automate the backpropagation. We do this by defining a function we call _backward. At initialization of our Value object _backward is defined as a function that does nothing.

self._backward = lambda: None

To understand how backpropagation is implemented let’s revisit a simple example

a = Value(2)
b = Value(3)
c = a + b

When ‘c = a + b’ is executed __add__(self, other) of Value instance ‘a’ is called. That function creates a new Value instance that Value instance ‘a’ references as ‘out’. ‘c = a + b’ assigns that returned Value instance ‘out’ to Value instance ‘c’.

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        
        return out

With backpropagation we want to calculate the impact of the nodes in the graph on the final result, in the case of a neural net, the loss. Starting with the final result we work our way recursively backwards towards the leafs of the graph, applying the chain rule. So as we traverse backwards through the tree we need a function that can be called on each node that is the result of an operation, to set the gradients for the operand(s).

def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')

    def _backward():
        self.grad = 1.0 * out.grad 
        other.grad = 1.0 * out.grad  
    out._backward = _backward

    return out

We extend the definition of __add__(self, other) so that besides performing the ‘+’ operation we also define a function _backward that applies the chain rule to calculate the gradient for the two operands of the ‘+’ operation. _backward is defined when the ‘+’ operation is performed but it is not executed. At the time the function is defined ‘self’ and ‘other’ reference ‘a’ (self) and ‘b’ (other). out.grad references grad in the out object that is at this time still 0, the default every Value instance is initialized with. As long as the reference at the time of the definition of the function is correct the value does not matter. The references are resolved when the function is executed in the backward pass after we have built the graph and at that time a real gradient value will be assigned to out.grad. While _backward is defined in instance ‘a’, the function is assigned to the result instance ‘out’ that is returned by __add__(self, other). And ‘out’ is assigned to ‘c’. When we call c.backward() ‘self’, ‘other’ and ‘out’ resolve to the reference at the time of the definition of the function. The gradients of self (a) and other (b) are updated. As the derivative of an addition is ‘1’ applying the chain rule sets the gradients for ‘self’ and ‘other’ to 1*out.grad.

We do the same for multiplication. To set the gradient for the first operand we need to multiply ‘out.grad’ with the data of ‘other’. For the gradient of the second operand we multiply ‘out.grad’ with the data of ‘self’.

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')

        def _backward():
            self.grad = other.data * out.grad
            other.grad = self.data * out.grad
        out._backward = _backward
            
        return out

As tanh only has one operand we only need to set self.grad applying the chain rule, multiplying out.grad with the derivative of tanh.

    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
        out = Value(t, (self,), 'tanh')

        def _backward():
            self.grad = (1 - t**2) * out.grad
        out._backward = _backward
        
        return out

If we run _backward() for each node starting with the final result node ‘o’ we get this graph

Now let’s look at how we can go backwards. What we need is a function that we can call for the rightmost node ‘o’. It needs to convert the graph to a list of nodes that we can iterate over sequentially. Then we can call the backward() function sequentially for each node, to populate the gradients of the corresponding operands. The ordering process from graph to list is called a topological sort.

We first initialize an empty list ‘topo’ and define a function ‘build_topo’ that takes a node ‘v’ as an argument. We loop over each child of v (v._prev) and call build_topo again out of build_topo itself, a classical recursion very similar to the trace() function we built above when we wanted to visualize the graph with Graphviz. So we nest build_topo functions, called in one or two ‘for loops’, like russian dolls until we hit a node that does not have a child. For that node the for loop is skipped and that node is added to the list with topo.append. Then for loop processes its sibling (if there is one), calling built_topo on the sibling following the sibling’s dependencies. This implies that a node only adds itself to the list once all its children have been processed. This is the code

topo = []
def build_topo(v):
    for child in v._prev:
        build_topo(child) 
    topo.append(v)
    return

build_topo(o)

And here the same code with some additional print statements and the tree to make it more digestible.

With this list we can now update all of our gradients

build_topo(o)
o.grad = 1

for node in reversed(topo):
     node._backward()

Andrej then adds this function to the Value object so that we can just call node.backward (in our example o.backward()) on the root node to start the backpropagation.

    def backward(self):
        topo = []
        
        def build_topo(v):
            for child in v._prev:
                build_topo(child) 
            topo.append(v)
            return

        self.grad = 1
        build_topo(self)
        for node in reversed(topo):
            node._backward()

Next Andrej points out that there is a ‘bug’ in our code. The issue arises when we the same node contributes more than once to a graph. In this example

we calculate ‘a+a’. The implicit function __add__(self, other) is called for the ‘+’. operation. __add__ defines the backpropagation function as

        def _backward():
            self.grad = 1.0 * out.grad 
            other.grad = 1.0 * out.grad

self and other now reference the same object a. When we call backward on ‘b’ to backpropagate self.grad referencing Value instance ‘a’ is set to’1′. In the next operation in _backward() other.grad, also referencing Value instance ‘a’, sets the gradient to ‘1’ again. But the gradient should be set to ‘2’ as ‘a’ contributes twice to the result. Writing ‘a+a’ as 2*a it becomes obvious that the gradient should be set to 2.

Andrej also shows another slightly more complex example

a = Value(-2, label='a')
b = Value(3, label='b')
d = a * b; d.label = 'd'
e = a + b; e.label = 'e'
f = d * e; f.label = 'f'

f.backward()

The gradients are incorrect. b, for example influences the loss through two paths, the addition with a and the multiplication with a. The gradient for the addition with a is ‘1*e.grad’ (-6) and the gradient for the multiplication is ‘-2*d.grad’ (-2), adding up to -8.00. But with the current implementation we call _backward once on node d, setting the gradient for node b to ‘-2*d.grad = -2’ and then later as we traverse through the list, _backward is called again on node e setting the gradient to ‘1 * e.grad = -6’, overwriting the ‘-2’.

In general, if variables are used more than once their gradients need to accumulate the gradients for each path. We can fix this by replacing ‘=’ in _backward with ‘+=’ for example

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
            
        return out

Improving our Value object and breaking up the tanh non-linearity

For convenience we now make an addition to magic functions in the Value object to automatically convert numbers to instances of Value objects so that an addition of a Value object to an integer automatically converts the integer to a Value object.

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)

We also want to be independent of the order of operations. Up to now we cannot multiply or add a number with an instance of a Value object if the number is the first element of the operation. For a ‘+’ operation for example, __add__ is always called for the left operand. If the left operand is a number Python does not know that we want to convert that first number to a Value object.

We can use the dunder function __radd__ that is called when there is an instance of a Value object on the right side of an operation.

    def __radd__(self, other):
        return self + other

In __radd__ self references the instance on the right side and other the numeric value on the left. Now self is an instance of a Value object and other is a numerical value. With __radd__ returning the sum of self and other, __add__ is then called. As we added the line with isinstance in __add the numerical value is automatically converted to an instance of a Value object. This works analogous for __mul__ and __rmul__.

Now we want to break up our tanh functions into smaller exponential operations.

$tanh( x) =\frac{e^{2x} -1}{e^{2x} +1}$

We first define an exponent function for Value. When _backward is called ‘self’ references the input object of the operation and ‘out’ references the result node of the operation. The exponential function is its own derivative so applying the chain rule we can set the gradient for self (the input to exp) as the output times the gradient.

        def exp(self):
        x = self.data
        out = Value(math.exp(x), (self,), _op='exp')

        def _backward():
            self.grad += out.data * out.grad
        out._backward = _backward
        return out

We implement the division of x divided by y, as the product of x and y^(-1). For the derivative we can apply the chain rule in conjunction with the rule for the derivative over ‘the power of’

$\frac{d}{dx} x^{n} =n*x^{n-1}$

So we define

    def __pow__(self, other):
        assert isinstance(other, (int, float)), "Called __pow__ with non-int/float argument!"
        x = self.data
        y = other
        t = math.pow(x, y)
        out = Value(t, (self, other), f'pow {other}')

        def _backward():
            self.grad += out.grad * y * math.pow(x, (y - 1.0))
        out._backward = _backward

        return out

Note that Andrej just uses the ‘**’ operator where I used math.pow(). The results are the same and it should not matter. Now we can define the division

    def __truediv__(self, other):
        return self * other**-1

We also have not implemented subtraction. For that we use the __neg__ magic function and define __sub__ as the sum of a number and __neg__ (executed when Python detects a prepended ‘-‘ operator) of the other operand we are subtracting.

    def __neg__(self):
        return self * -1
    
    def __sub__(self, other):
        return self + (-other)

Now we can express tanh as

$tanh( x) =\frac{e^{2x} -1}{e^{2x} +1}$

# input x1, x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')

# weights w1, w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')

# neuron bias
# b = Value(6.7, label='b')
# b = Value(8.0, label='b')
#b = Value(6.88137358, label='b')
b = Value(6.8813735870195432, label='b')

# x1 * w1 + x2 * w2 + b
x1w1 = x1 * w1; x1w1.label = 'x1 * w1'
x2w2 = x2 * w2; x2w2.label = 'x2 * w2'

# x1 * w1 + x2 * w2
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1 * w1 + x2 * w2'
# x1 * w1 + x2 * w2 + b
n = x1w1x2w2 + b; n.label = 'n'
e = (2*n).exp()
o = (e-1) / (e+1); o.label = 'o'

#o = n.tanh()
o.label = 'o'
o.backward()
draw_dot(o)

and should get the same result as before

But wait a moment … these gradients are ‘off’. What happened? “We have a bug” …

In an attempt to keep things simple and minimize what is needed for the topological sort I used this algorithm

topo = []
def build_topo(v):
    for child in v._prev:
        build_topo(child) 
    topo.append(v)
    return

build_topo(o)

Andrej’s code was slightly different

I followed the path and got the same results as Andrej when using the tanh() non-linearity.

But when I expressed tanh() as

$tanh( x) =\frac{e^{2x} -1}{e^{2x} +1}$

I got the wrong results with my ‘shortcut implementation’ of build_topo(). The bogus values are due to the fact that (you guessed it) I am not checking for ‘visited’ nodes in the graph.

For

e = (2*n).exp()
o = (e-1) / (e+1); o.label = 'o'

The graph looks like this (please ignore the wrong gradients here)

The end result ‘o’ is a multiplication of the unlabeled node representing the numerical value of 1/(e+1) (data = 0.1464) and the other unlabeled node (e+1) (data = 4.8284). When we call o.backward()

    for child in v._prev:
        build_topo(child)

enters the loop over these two unnamed nodes. For the first node (does not matter which one it picks first) we traverse recursively towards the left side of the tree. When we are done with the recursion on this first branch, the for loop starts processing the second node traversing recursively to the end of the tree a second time. This adds the same node multiple times to the ‘linear list’ of nodes. Subsequently _backward() is called multiple times on the same node and gradients are added up when they should not. So after accounting for visited nodes

    def backward(self): #mine
        topo = []
        nodes_visited = []
        
        def build_topo(v):
            if v not in nodes_visited:
                nodes_visited.append(v)               
                for child in v._prev:
                    build_topo(child) 
                topo.append(v)
                return

        self.grad = 1
        build_topo(self)
        for node in reversed(topo):
            print(f'{node.label} {node.data}')
            node._backward()

Everything works as expected

The same in Pytorch

The cool thing is that we have been building so far is that it is very similar to Pytorch in terms of semantics. With Pytorch the model would be defined like this

import torch
x1 = torch.Tensor([2.0]).double()                ; x1.requires_grad = True
x2 = torch.Tensor([0.0]).double()                ; x2.requires_grad = True
w1 = torch.Tensor([-3.0]).double()               ; w1.requires_grad = True
w2 = torch.Tensor([1.0]).double()                ; w2.requires_grad = True
b = torch.Tensor([6.8813735870195432]).double()  ; b.requires_grad = True
n = x1*w1 + x2*w2 + b
o = torch.tanh(n)

print(o.data.item())
o.backward()

print('---')
print('x2', x2.grad.item())
print('w2', w2.grad.item())
print('x1', x1.grad.item())
print('w1', w1.grad.item())

We choose double precision floating point for our numbers and define requires_grad to be True so that Pytorch will calculate gradients for the leaf nodes, which it does not do per default. Just like our Value object these Pytorch scalar tensor objects have a data and grad member variable. One thing that is different is that if we want to extract the numerical value from the scalar tensor we need to append ‘.item()’. And magically we end up with the exactly same gradients

So why would we want to use Pytorch when it effectively does the same as our simple Value object? Because when we are working on real world examples we will work with large matrices and vectors. Pytorch tensors have been optimized for large scale matrix manipulations that Pytorch parallellizes really well.

Building a simple neural network

Now we can start building a neural network based on our Value object

class Neuron:
  
  def __init__(self, nin):
    self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
    self.b = Value(random.uniform(-1,1))
  
  def __call__(self, x):
    # w * x + b
    act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
    out = act.tanh()
    return out
  
  def parameters(self):
    return self.w + [self.b]

At initialization time we specify how many elements the ‘weight vector’ of the neuron has and the values are initialized randomly following a uniform distribution between -1 and 1. The bias is also randomly initialized. When we pass an input vector x (that has to be of the same size as the number of elements in w) the __call__ function in class Neuron calculates the dot product, adds the bias and applies the non-linearity. For the dot product Python’s zip function pairs up each element of x with the corresponding element of w. sum() takes a second argument as offset for the calculation and we use that to add the bias. Then the non-linearity is applied. The result is a ‘number’ in form of a Value object. The parameters() function returns weight and bias.

In neural networks you have multiple neurons an input is applied to. That group of neurons is called a layer of neurons and we organize the neurons in the layer as a Python list. The input to the layer is evaluated independently for each of the neurons in the layer. So while a neuron produces a single number as output for an input vector the layer produces a list of numbers/Value objects, one for each neuron in the layer.

class Layer:
  
  def __init__(self, nin, nout):
    self.neurons = [Neuron(nin) for _ in range(nout)]
  
  def __call__(self, x):
    outs = [n(x) for n in self.neurons]
    return outs[0] if len(outs) == 1 else outs
  
  def parameters(self):
    return [p for neuron in self.neurons for p in neuron.parameters()]

The initialization function takes the number of elements for each neuron (nin) and the number of neurons in the layer (nout) as input and creates a list of neurons. __call__ passes the input vector x to the list of neurons. Each neuron in the layer generates one Value instance in the list. The ‘return outs[0] if …’ line is just for convenience so that in case of a list with a single Value instance we just return the instance itself. The parameters function is a Python list comprehension that loops over all neurons and calls parameters() for each neuron.

A Multi-Layer Perceptron is a list of layers the input passes through. The output of each layer is the input of the next layer.

class MLP:
  
  def __init__(self, nin, nouts):
    sz = [nin] + nouts
    self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
  
  def __call__(self, x):
    for layer in self.layers:
      x = layer(x)
    return x
  
  def parameters(self):
    return [p for layer in self.layers for p in layer.parameters()]

To initialize the MLP we pass the number of input values and a list that specifies the number of outputs for each hidden layer. It is easiest to understand if we look at an example:

x = [2, 3, -1] 
n = MLP(3, [4, 4, 1])
n(x)

‘sz’ in __init__ for the MLP is [3, 4, 4, 1] and three layers are generated with

self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]

The first layer i.e. the first element in the list of layers is initialized with 3 inputs and 4 outputs calling Layer(3, 4). The second layer has 4 inputs and 4 outputs and is initialized with Layer(4, 4) and the last layer has 4 inputs and one output and is initialized with Layer(4, 1). The __call__ function called when initializing n(x) goes over all the layers. Starting with the initial input vector the output of one layer is passed as the input to the next layer (x = layer(x)). In our example, the last layer has only a single value.

    for layer in self.layers:
      x = layer(x)

This image tries to visualize the MLP we built

Now that we have an understanding of the building blocks of a neural network let’s look at how we can quantify the quality of the network. We use the example MLP ‘n’ that we initialized above with

n = MLP(3, [4, 4, 1])

and we feed four input vectors of size 3 into the MLP

xs = [
    [2.0, 3.0, -1.0],
    [3.0, -1.0, 0.5],
    [0.5, 1.0, 1.0],
    [1.0, 1.0, -1.0],
]
ypred = [n(x) for x in xs]

So each vector is passed through the entire MLP producing a single output value. [2.0, 3.0, -1.0], for example is the input of the first layer with four neurons. For the first layer we get four ouput values. The next layer takes four input values and also produces four output values. The final layer takes four input values and produces a single output. ypred is the list of output values going over four rows in the input matrix xs.

Now we want to get a sense of the quality of the model. To do that we need to have the outputs we want the model to produce for our inputs.

ys = [1.0, -1.0, -1.0, 1.0]

is the output we want our model to generate if we feed it the input xs. For example, the first row of xs [2.0, 3.0, -1.0] fed into the model should produce a 1.0, second row of xs should produce -1.0 etc. This is the basic idea of neural networks: We have a randomly initialized MLP. We know our input values and we know the output we want the network to predict. As the inputs are a given we optimize the parameters of the MLP so that it produces a result as close as possible to the known output. Once the model has been trained that way we hope it generalizes (more on that later as well) and can be used to predict outputs for any set of inputs.

To evaluate the quality of the model we use a number called the loss. We implement the mean square error loss for our example.

loss = sum([(yout - ygt)**2 for ygt, yout in zip(ys, ypred)])

We loop over all desired results ys and calculate the difference with the respective predicted result ypred. As we always want a positive value we square the difference (we could also just use the absolute) and sum these squared differences up. If the model perfectly predicts a value the difference is 0. The perfect model predicts all values accurately and has zero loss.

Now we can call loss.backward() to calculate the gradient for each weight with respect to the loss i.e. get a number for each weight in our MLP that indicates how that weight contributes to the loss.

loss.backward()

If we call n.parameters on the entire MLP we see that we have 41 parameters in the model, which adds up as we have a 3×4 matrix (12 w parameters plus 4 b parameters) followed by a 4×4 matrix (16 w parameters plus 4 b parameters) and a 4×1 matrix (4 w parameters plus 1 b parameters). We can now adjust the parameters in the direction of the negative gradient. You remember from the discussion at the beginning that the gradient/slope for a parameter multiplied by a small step-width is an approximation of how the value of the loss increases. To optimize we go into the ‘negative’ direction to adjust the parameter so that the loss decreases.

for p in n.parameters():
    p.data += -0.01 * p.grad

So after the loss calculation and parameter adjustments we actually see the loss of our model decrease from 6.88 to 6.24

As we keep looping through the last three cells we see that the loss continues to decrease and eventually we come very close to zero loss. The stepwidth, in our case 0.01 is a parameter that determines the increments for adjusting the model parameters. If the stepwidth is too small we converge very slowly toward lower losses. If it is too large we introduce instabilities. We may ‘miss’ the minimum and may not converge at all. After multiple iterations n.parameters shows the model parameters for which the model delivers a good prediction.

We can now automate this loop that we went through manually

As Andrej points out there is a subtle bug with the current implementation, even though the model still converges to a low loss. The issue is that we forgot to reset the gradients before each backward pass. Through the ‘+=’ operator in the _backward function defined for each operation in Value the gradients add up in case we have contributions of parameters through different paths. The gradients also add up over the iterations we are going through, something we don’t want. We need to calculate the gradient separately for each iteration. Hence we need to reset the gradients right before calling backward().

    # Reset gradients to 0
    for p in n.parameters():
        p.grad = 0.0
    
    # Backward pass
    loss.backward()

The only reason this worked before is that the problem is very simple. The accumulation of the gradients is equivalent to an increased step size. Now that we fixed this issue we need more iterations and/or a higher learning rate to converge to a low loss. What we observed is what makes working with neural networks ‘tricky’ as the network may still work even in light of bugs. Here we only got away with it as the problem is very simple.

I will not be covering the mapping of what we just did to the migrograd Andrej has on github. This is all pretty self explanatory. What I do want to echo though is that Pytorch allows for registering any type of function, and as long as we define a forward and backward pass, and we can use that function as a ‘Lego’ building block for graphs we are building.

What I learned today

Intro to LLMs: Andrej Karpathy’s Micrograd

Leave a ReplyCancel reply

Intro to LLMs: Andrej Karpathy’s Micrograd

Leave a ReplyCancel reply

Discover more from What I learned today