a device that is generating concepts

Leave a comment

Alphabet compression for language efficiency improvement (English)


Couple days ago, we have discussed with my fellow Bassam specifics of Arabic language & he has mentioned a work of Said Akl that has developed a new Lebanese alphabet, while among various considerations there was a principle of simplicity.

I thought about various ways to reach simplicity in aspect of energy that is required to learn & read as well as the aspect of preservation of information. One of directions towards simplify might be a unification of letters with similar phonetics – an approach, which was adopted by many languages throughout evolution. The downside of this as I see, is that this is a kind of “lossy compression”. The minimal energy approach that would focus on simplicity of alphabet & less on preservation of its quality can cause loss of uniqueness for many words. But while variability (one of characteristics of beauty) reduction of the language, that is eventually done to ensure the easiness of a brain to learn, actually, the total energy consumption throughout the lifetime might be higher. By removal of uniqueness of sound-letter mapping, it might cause different words to be spelled similarly i.e. growth in amount of homonyms & homographs. So indeed, it would be easier to learn reading, but we will persistently spend more energy to “decode” the meaning based on the context.

Conclusion – we would need something that would give us simplicity of writing without loss of language heterogeneity. How can we do so? It can be done, for example, by analyzing language statistics & use some basic “lossless compression”.

Let’s take English words & split all words to combinations of 2 & 3 letters. For each combination we can calculate the frequency of appearance. It is called a “second-order” and “third-order” models in data compression theory. You can calculate it by yourself or rely on the existing work [second-order model link][third-order model link].

At the top we have some of pretty obvious combinations & some that are less expected. Now we can take most frequent combinations & define a new character that would represent it. Some are already exist (e.g. and = &, at = @). For others, there will be a need in development & standardization.

Based on second-order model only we can save 5% of text by replacing 2 characters by one:


…et cetera.

All tables and charts are uploaded. Feel free to use.

Plotting potential saving per # of new characters based on second-order model:


Of course it does not make sense to define new 300 characters, unless we are taking the compression to extreme and define hieroglyphs.

There should be some optimal point between alphabet complexity and its efficiency. Adding 300 more characters seems very non-efficient for 30% of compaction, while additional 6 characters from the table above, would shrink our written content by 5%.

Well… what about 3rd order? The frequency of 3rd order appearances combinations is so small, that even higher gain (2/3), comparing to 2nd order gain (1/2) in most cases does not worth definition of new characters:


For example, defining all “and” appearances as a single character (e.g. &) is giving only 0.47% of reduction, while “an” is giving 0.7%. Their combination is not giving a good performance as well… So in this case if we define “an”, it is not efficient to use &.


The impact of such transition might be huge. Speed of reading, transition of data, size of printed material, information density, speed of printing are going to be improved.


To support the behavioral change another thing has to be done – centralization of those characters. Historically, “@”, “&” require more complex operation of pressing the “Shift” button during typing. It adds instant barrier for frequent usage & predefines them as secondary. Later, in touch screens, those characters have moved to a swapped keyboard view & so became even more distant.

Indeed, change of physical keyboards would be complex, but touchscreen products can adopt it pretty seamlessly.

Nice? 🙂