N-gram language models

An N-gram language model generalises the Markov assumption by conditioning the next token on the previous \(n-1\) tokens, approximating the true probability as follows: $$P(w_i \mid w_{i-1},\dots,w_{1}) \approx P(w_i \mid w_{i-1},\dots,w_{i-n+1}).$$ Clearly, when $n=2$ this reduces to a bigram model in the previous demonstration. Using our frequent example, we would like to find $$P(\text{dog} \mid \text{ the quick brown fox jumps over the lazy}).$$ The N-gram model approximates this as $$\text{2-gram}: P(\text{dog} \mid \text{lazy}),$$ $$\text{3-gram}: P(\text{dog} \mid \text{the lazy}),$$ $$\text{4-gram}: P(\text{dog} \mid \text{over the lazy}),$$ $$\ldots$$

Like the bigram model, we estimate these probabilities from counts in a corpus as follows: $$P(w_i \mid w_{i-1},\dots,w_{i-n+1}) = \frac{\text{C}(w_{i-n+1},\dots,w_{i-1},w_i)}{\text{C}(w_{i-n+1},\dots,w_{i-1})}.$$

N-gram table

Below, you can build an N-gram table from a chosen corpus (or your own text), then inspect the most frequent N-grams. Note that $N-1$ start-of-sentence tokens $\langle s\rangle$ are used to pad the beginning of each sentence.

N-grams as a generative model

This demo generates text either by randomly sampling the next token from $P(w_i \mid w_{i-1},\dots,w_{i-n+1})$, or by greedily (and deterministically) selecting the most likely next token at each step.

Discussion

The 3-gram (trigram) model captures more contextual information than the bigram model, and consequently can generate text that appears more coherent. In general, as we increase $N$, the model conditions on a longer history, and the output often seems more grammatically consistent and contextually appropriate. However, this apparent improvement should be interpreted with caution. On closer inspection, higher-order $N$-gram models frequently reproduce long fragments of the training corpus almost verbatim.

This occurs because language is highly diverse and combinatorially rich: many longer $N$-grams (e.g., 6-grams) may occur only once in the dataset. When this happens, the model has effectively memorised that exact continuation rather than learned a generalisable pattern. For example, the famous hexagram, "to be or not to be", occurs only once in Shakespeare's works, so a 6-gram model trained on that corpus would always predict "be" after "to be or not to".

This is overfitting: the model has learned to predict the training data with high accuracy, but it has not learned general patterns that would allow it to generate or predict novel text. To assess the true quality of an N-gram model, we would need to evaluate it on a held-out test set of unseen text.

A common evaluation metric is perplexity, which measures how well the model predicts a sample of text that was not used during training. Roughly speaking, perplexity can be thought of as the average number of guesses the model would need to make to predict the next token in the test set, with lower perplexity indicating better performance. The perplexity of a state-of-the-art trigram model in the 1990s, trained on one million words, was 247. A fantastic resource for understanding perplexity in the context of character prediction can be found here.