Language models in the Age of AI

Large Language Models (LLMs), such as ChatGPT, Gemini, Llama and Claude, have revolutionised the way in which computers process and generate natural language. These systems can write essays, produce code, summarise long documents, translate between languages, and engage in sustained dialogue with a level of fluency that would have seemed implausible only a decade ago.

However, despite their growing presence in everyday life, the basic ideas behind these models remain unfamiliar to most people. Terms like "neural networks", "tokens", and "transformers" can make the subject feel more daunting than it really is. This interactive series is designed to make these ideas intuitive and accessible.

We begin by exploring the simplest statistical language models that have been used since the 1980s, drawing inspiration from the excellent book, Speech and Language Processing by Daniel Jurafsky and James H. Martin. We then describe Word2vec and its role in word embeddings, a major breakthrough that unlocked the power of neural networks for the task of natural language processing that ultimately inspired the LLM architectures we use today.

Language as a random process

To understand Natural Language Processing (NLP), it is essential to set aside our preconceived notions about language and instead treat it (perhaps unromantically) as a stochastic process. A language model, including state-of-the-art LLMs, is designed to answer one fundamental question:

Given the words (or, more specifically, the tokens) we have already observed, what is the most likely next token?

For example, consider the following incomplete sentence:

“The quick brown fox jumps over the lazy ___”

Most English speakers would confidently fill in "dog", recognising the well-known pangram. However, the prediction becomes far less certain in the following example:

“The dog was ___”

Now the situation is different. The next token could plausibly be "big", "old", "eating", "smelly", "barking", and many others. Some words, like "car" or "was", would be jarring in this context.

Rather than making a single guess, a language model assigns a probability to every possible next token. Good predictions place high probability on words that fit the context and low probability on those that do not. This is the central idea: a language model learns a probability distribution over sequences of tokens. Exactly how these probabilities are estimated is what distinguishes one language model from another and ultimately determines its quality.

Exercise: predict the next word

Open your preferred messaging app on your phone (e.g., iMessage or WhatsApp). Begin typing a sentence using autocomplete only - you may need to provide the first word as a prompt. Allow the suggested words to guide the sentence. As you do this, consider the following questions:

Predictive text systems can generate sentences that are grammatically correct but semantically nonsensical. Certainly, we would not expect to hold a conversation with a predictive text system, as we would a modern LLM. Nevertheless, the underlying principle is similar: both systems estimate the probability of the next token given the preceding context.

In the context of a messaging app, this simple approach is often sufficient. Users tend to rely on common phrases and informal patterns, and they are generally tolerant of occasional errors. For this limited task, a lightweight predictive model is entirely fit for purpose.

More broadly, autocompletion can have large cumulative effects on the user experience. Google estimates that its search engine autocomplete feature saves its users 200 years of typing every day.

Next demonstration

In the first demonstration, we explore a simple class of language models called Markov models.