a device that is generating concepts

Semantics: How many words are we using to name something


As I am working now on some language processing project, I am going to post various interesting findings from time to time.

For example – how many words are we using to tell a term? Based on ~1 million of terms (English) sample size, here is the chart, describing % of terms versus words count to describe the term:

words count


You can see that 20% of things we call by one word only. Most of things (40%) we call by using of two words and then it goes down. Interesting that starting two words and on, the probability to use more words to describe the term is cut by half each time (actually the coefficient is ~0.48)!

ngram>0 (total) 974267 Multiplier
ngram>1 779238 0.800
ngram>2 375672 0.482
ngram>3 163705 0.436
ngram>4 79690 0.487
ngram>5 38585 0.484
ngram>6 17478 0.453
ngram>7 8439 0.483

That’s kind of semantic law 😉

To be a bit more precise, this is not a Poisson distribution, but an “Erlang Distribution” .

After a brief check, Probability of n-gram:

P(n) ~ Erlang(n,5,2)


Just nice to know 🙂


Author: Andrey Gabdulin Product Development

4 thoughts on “Semantics: How many words are we using to name something

  1. And interestingly enough, the words we most often are also shorter than the words we use less often. This is the least energy principle in action.

    one, two three four, five.

    Lots shorter than the next ones and higher.

    • Generally speaking, it might be a trend on semantic macro level (need to double check, or if you have some article, would be glad to see it), though in details that’s definitely not the rule. Take a look at first most common words in English, you can see a high variability between 1 and 7 letters:
      And specifically here, you can see that there are more 2-word tokens then 1 word (even though it is not about the frequency of use, but about definition).

  2. What corpus are you using ? What constitutes an occurrence of description? I’m curious to see how it would play out cross linguistically.

    • In this particular work, it was a wiki data, so you can easily check it for other languages. I am pretty sure that it would dramatically differ between let’s say German with its long concatenations and Hebrew with small words and limited baseline variability, so they combine more.
      The assumption that in Encyclopedia, the definition is trying to get its minimal form, bringing to ability to accumulate ngrams as indicators of language complexity and maturity of the society.
      It is also possible by processing ngram 1, 2, 3… of Google ngram database for last 500 years.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s