Concepton

a device that is generating concepts


4 Comments

Semantics: How many words are we using to name something

As I am working now on some language processing project, I am going to post various interesting findings from time to time.

For example – how many words are we using to tell a term? Based on ~1 million of terms (English) sample size, here is the chart, describing % of terms versus words count to describe the term:

words count

 

You can see that 20% of things we call by one word only. Most of things (40%) we call by using of two words and then it goes down. Interesting that starting two words and on, the probability to use more words to describe the term is cut by half each time (actually the coefficient is ~0.48)!

ngram>0 (total) 974267 Multiplier
ngram>1 779238 0.800
ngram>2 375672 0.482
ngram>3 163705 0.436
ngram>4 79690 0.487
ngram>5 38585 0.484
ngram>6 17478 0.453
ngram>7 8439 0.483

That’s kind of semantic law 😉

To be a bit more precise, this is not a Poisson distribution, but an “Erlang Distribution” .

After a brief check, Probability of n-gram:

P(n) ~ Erlang(n,5,2)

 

Just nice to know 🙂