Unpacking ChatGPT: is AGI that close?

Recently, I had the opportunity to give a talk titled “Unpacking ChatGPT: And how close we are from AGI?”. This post expands on that talk, diving into the fundamental concepts behind large language models like ChatGPT, their capabilities, and the ongoing discussion about Artificial General Intelligence (AGI).

At its core, and perhaps surprisingly, a lot of what goes on in these complex systems is just matrix multiplication. This might sound like a massive simplification, but it’s a useful starting point to understand the underlying mechanics. I even shared an anecdote from my high school days in 2009 when Sony released the Cyber Shot camera with “Smile Shutter” functionality, and a friend’s brother, already in university for computer science, remarked, “this is just matrix multiplication”. While the specific implementation details might involve more advanced math, the fundamental operation often boils down to these linear algebra principles.

The revolution powering these models is the Artificial Neuronal Network (ANN). Artificial Neural Networks were first proposed in the 1960s and have been used in Computer Science to model complex non-linear problems ever since. However, it was in 2013 that an ANN (AlexNet) first significantly advanced the state of the art in image recognition, marking a turning point.

So, what exactly is an artificial neuron? You often see them depicted with inputs, weights, and an output. Mathematically, it’s a function that typically gets two vector inputs, multiplies them pairwise, sums the results (a dot product), and then passes this through an activation function like the sigmoid function. The sigmoid function is particularly special because it transforms any real number into something resembling a probability, which is important in neural networks. Here’s how it can be coded in a simple way:

```python from math import exp

def sigmoid(x): return 1 / (1 + exp(-x))

def dot(x, y): sum = 0 for i, j in zip(x,y): sum += i*j return sum

def artificial_neuron(inputs, weights, bias): weighted_sum = dot(inputs, weights) + bias output = sigmoid(weighted_sum) return output ```

Training these networks is where a lot of the complexity lies. It involves feeding the network data with known inputs and desired outputs, calculating the error between the network’s prediction and the actual value, and then adjusting the network’s parameters (weights and biases) to minimize that error. This process is repeated over and over until the error is acceptably low. While a full explanation of training is “another math class”, the core idea is initialized parameters are random, and through training with data, they are corrected based on the prediction error.

Neural Networks primarily work with vectors, which leads to the question: How do we transform words into vectors?. We can actually use a neural network itself to produce these word vectors. The process involves mapping words to an initial representation (like binary), feeding this into a network, and training it to predict context words or the central word in a sequence. The vector representation for a word is often the output of an inner layer of this network. These vectors have a remarkable property: we can add and subtract words!. For example, the vector for “Lisbon” minus the vector for “Portugal” plus the vector for “France” results in the vector for “Paris”. Similarly, “brother” minus “men” plus “woman” equals “sister”. This demonstrates that the vectors capture semantic relationships.

Processing text involves putting words together in sequences. Early approaches using recurrent neural networks with a memory module could handle small texts but struggled with larger ones because they had difficulty retaining context over long sequences. Long texts have different grammatical structures, and the meaning of words can change based on their position in the text. To address this, the Transformer architecture was introduced, based on the idea that “Attention is all you need”. Instead of processing word by word with a limited memory, the Transformer looks at the entire sequence simultaneously, correlating words with each other to understand their interactions and context. This is visualized as a matrix showing how words in a phrase relate to each other.

Now, let’s look at the G and the P in ChatGPT.

G stands for Generative. The model is trained to predict the next word in a sequence, which allows it to generate text.
P refers to the training process, which has two phases.
1. Unsupervised Pre-training: The model is trained on a large dataset of raw text without specific labels, learning to predict the next word. This is where the model learns text structure and context.
2. Supervised Fine-tuning: The model is then trained on a smaller, well-curated labeled dataset in a supervised fashion. This post-training phase tailors the model for specific tasks.

The architecture of GPT involves repeating blocks that contain the attention mechanism (where it learns structure and context) and a feedforward network (where it stores facts and knowledge). This block is repeated multiple times (e.g., 12 times in the diagram shown in the talk, described as “Magic” for why that specific number).

This brings us to the big question: What about AGI?. Defining general intelligence is tricky. Human intelligence is the most general kind we know, and language is just one aspect of it. Intelligence, broadly, is a behavior of nervous-system endowed creatures that controls vital functions. The human brain is a Complex Adaptive System, tailored by evolution over millions of years, with billions of flexible neurons and trillions of adaptive connections. In contrast, models like GPT-4 have a large number of parameters (e.g., 1.76 billion mentioned, though likely intended to be trillion based on other context or comparisons), but these parameters are fixed once training is complete; the network doesn’t adapt automatically to new data like a biological brain.

There are arguments against AGI when looking at current models like ChatGPT:

ChatGPT primarily deals with written human language.
It has distinct phases of learning (training) and inference (usage), unlike continuous human learning.
It learns over a curated dataset and a limited amount of tokens; it produces stuff it has seen in training and cannot truly invent.
Training is extremely power-consuming; training GPT-3 took around 1000 MWh, compared to the human brain using about 20Wh. Scaling requires either more compute power (which we know how to add) or more data (and we don’t know how to scale ‘good’ datasets beyond what’s available).

However, there are also arguments in favor of AGI based on current capabilities:

GPT is already general enough to understand and seamlessly translate between many different languages.
It can store a vast amount of information in a very compact way (GPT-4 mentioned as 7.2 terabytes).
It can do math, and research is ongoing to train models to solve complex math problems.
Even if it’s “just a bunch of matrix multiplications,” it is still highly effective for making products.

In final remarks, I believe Large Language Models (LLMs) will become ubiquitous in our productive tools. AI based on ANNs primarily generates content based on what it has seen during training. This doesn’t make them useless, nor does it necessarily make them dangerous. However, they do consume a huge amount of energy. Furthermore, being trained on vast amounts of data available online presents significant risks concerning intellectual property, serious journalism, and democracy.

Understanding the mechanisms behind these powerful tools is crucial as they integrate further into our lives.

Thanks for reading! If you have any questions, feel free to reach out.