As I am working now on some language processing project, I am going to post various interesting findings from time to time.
For example – how many words are we using to tell a term? Based on ~1 million of terms (English) sample size, here is the chart, describing % of terms versus words count to describe the term:
You can see that 20% of things we call by one word only. Most of things (40%) we call by using of two words and then it goes down. Interesting that starting two words and on, the probability to use more words to describe the term is cut by half each time (actually the coefficient is ~0.48)!
ngram>0 (total) | 974267 | Multiplier |
ngram>1 | 779238 | 0.800 |
ngram>2 | 375672 | 0.482 |
ngram>3 | 163705 | 0.436 |
ngram>4 | 79690 | 0.487 |
ngram>5 | 38585 | 0.484 |
ngram>6 | 17478 | 0.453 |
ngram>7 | 8439 | 0.483 |
That’s kind of semantic law 😉
To be a bit more precise, this is not a Poisson distribution, but an “Erlang Distribution” .
After a brief check, Probability of n-gram:
P(n) ~ Erlang(n,5,2)
Just nice to know 🙂