“An alien spaceship crashes into the Nevada Desert. Eight creatures emerge: a wug, a plit, a blicket, a flark, a warit, a jude, a ralex, and a timon. In at least two-thousand words, describe what happens next.” Go ahead, try it if you want. A curious pattern is likely to emerge. The second most used alien will appear about half as often as the most used alien in your story. The third most used alien will appear about a third as often as the most used alien, and so on, with the eighth most used alien appearing about an eighth as often as the first most used alien.
What word do you think you use the most? Chances are, it is the word “the.” How about the second? Perhaps unsurprisingly, it is the word “be.” In spoken and written English, the word “be” appears about half as often as the word “the.” The third most common word, “to”, appears about one third as often as the word “the.” The 5432nd most common word “grind” appears about 1/5432 times as often as the word “the” in the English language. What accounts for this?
The pattern where the frequency of any word is inversely proportional to its rank is called Zipf’s Law, named after the American linguist George Zipf. Amazingly, Ziph’s Law does not just apply to English. It applies to all languages, from Spanish to Turkish to Zulu to Tamil. Even languages we haven’t translated yet, like Meroitic, have been shown to obey Zipf. When you plot word rank versus frequency, you get a distribution that looks like this, with word rank on the x-axis and relative frequency on the y axis :
This curve, or probability density function, is called the Pareto distribution. Natural language follows a discrete form of the continuous Pareto distribution. When the data are graphed on a log-log scale, you get a straight line. For natural language, it looks something like this:
It’s not just language that is distributed in this way however. City populations roughly follow the Pareto distribution. Last names, ingredients found in cookbooks and recipes, and earthquake intensities all follow the Pareto distribution. Even the rate at which we forget, the famous Ebbinghaus curve, follows a Pareto distribution.
An interesting outcome from the Pareto distribution is the famous Pareto Principle, which states that in a general situation, 20 percent of the causes are responsible for eighty percent of the outcomes. This idea is everywhere. Microsoft reported that 80% of the errors in their products came from 20% of the bugs detected. The richest 20% of people have about 80% of the world’s wealth. In a house, about 20% of the carpet receives 80% of the wear. And of course, as we’ve seen, language follows a similar pattern.
How can something as creative and personal as language be described by the strict rules of mathematics? It’s hard to know. Some believe that distribution of words in a language is a compromise between speakers wanting to be as concise as possible and listeners wanting to have as much detail as possible. Beyond language, this curious curve has implications for our understanding of processes as disparate as wealth distribution and memory loss. Why is this? It is fun to hypothesize, but until we know for sure, we can continue to marvel at the strange little function that models so much of our world: the Pareto distribution.
And just for fun:
- Power laws, Pareto distributions and Zipf’s law. M. E. J. Newman, Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor. https://arxiv.org/pdf/cond-mat/0412004.pdf
- Zipf’s word frequency law in natural language: a critical review and future directions Steven T. Piantadosi. University of Rochester. https://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf
- The Zipf Mystery. Michael Stevens. Vsauce. https://www.youtube.com/watch?v=fCn8zs912OE