Bounds for Language Model Perplexity

Language model’s quality is often measured by perplexity. In this post, we will derive bounds for perplexity.

What is Perplexity?

Given a language model $L$, perplexity of the model on sequence of words \(w_1, w_2, \ldots, w_n\) is defined as:

\[ppl(w_1, \ldots w_n) = P(w_1, \ldots w_n)^{-\frac{1}{n}}\]

where \(P(w_1, \ldots w_n)\) is the probability of the sequence \(w_1, \ldots w_n\) under the language model \(L\).

We can think of perplexity as wordwise uncertainty of the language model. Let’s look at some properties of perplexity.

Properties of Perplexity

Assume that a language model has a vocab size \(V\). If the model guesses the next word uniformly at random, then the perplexity is \(V\). This is because the probability of the next word is \(1/V\) and hence the perplexity is

\[ppl(w_1, \ldots w_n) = (P(w_1) P(w_2 | w_1) \ldots P(w_n | w_{n-1} \ldots w_1))^{-\frac{1}{n}} = (\frac{1}{V}^n)^{-\frac{1}{n}} = V\]

If the model has memorized the word sequence, then it will assign probability 1 to the next word. In this case, the perplexity is

\[ppl(w_1, \ldots w_n) = (P(w_1) P(w_2 | w_1) \ldots P(w_n | w_{n-1} \ldots w_1))^{-\frac{1}{n}} = (1^n)^{-\frac{1}{n}} = 1\]

However, let’s say that the model sees a new word in the sequence, let’s say, at position \(i\). Then we see that

\[ppl(w_1, \ldots w_n) = (p(w_1) \cdots p(w_i | \ldots) \cdots p(w_n | \ldots))^{-\frac{1}{n}} = 0^{-\frac{1}{n}} = \infty\]

Bounds for Perplexity

Let’s now look at bound properties for perplexity. In this part, we will look into average sentencewise perplexity and show some interesting results.

Proposition 1: Given any corpus, there is a language model that has average sentencewise perplexity less than or equal to the vocabulary size.

Proof: we can always construct a language model that has perplexity of the vocabulary size by assigning same probability to all words. Thus the perplexity of a sensible language model should be at most the vocabulary size, \(V\).

An interesting question is whether there is a corpus where the best language model has perplexity equal to the vocabulary size.

Proposition 2: A corpus where every word is chosen independently with same probability has the best language model with perplexity equal to the vocabulary size.

Let’s now look at a theoretical corpus where every word is chosen independently with same probability. \(ppl(w_1, \ldots w_n) = (P(w_1) P(w_2) \ldots P(w_n))^{-\frac{1}{n}} \sim ((P(v_1)P(v_2) \ldots P(v_V))^\frac{n}{V})^{-\frac{1}{n}} = (P(v_1)P(v_2) \ldots P(v_V))^{-\frac{1}{V}}\)

We know the lower bound to the following inequality using the AM-GM inequality.

\[\frac{1}{V} \sum_{i=1}^V P(v_i) = \frac{1}{V} \geq \left(\prod_{i=1}^V P(v_i)\right)^{\frac{1}{V}}\]

Thus we have that the lower bound of perplexity on this corpus is $V$. Furthermore, this perplexity is achieved if and only if \(P(v_i) = \frac{1}{V}\) for all $i$ from AM-GM inequality.

Thus the bound for proposition 1 is tight.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Multinomial Pascal's Triangle
  • Ranking Pokemon Types
  • Translation of IU's Dear Name
  • Convergence of Minimax complete Binary Tree
  • Two Envelope Problem