Skip to content

Tag: Script Encoding Initiative

Understanding Unicode

Unicode is a computer standard that acts as a character coding system. The name comes from three goals of the standard: to be universal, to be uniform and to be unique (Source: Summary). This means that Unicode aims to give all characters in every world language a unique fixed-width number (called a code point).

Code points are stored in computers as one or more bytes, which is a unit of storage equivalent to eight bits (the smallest unit of data in a computer). Character encoding involves converting the bytes, stored in computer memory, back into the characters you want to display. This makes encoding an important part of ensuring the readability of a text. Without the connection formed between characters and their corresponding bytes, characters cannot be displayed correctly (Source: Character encodings for beginners).

The characters contained in the Unicode standard can be encoded by the character encoding formats UTF-8, UTF-16 or UTF-32. The difference between these forms is that UTF-8 uses 8-bit units, and UTF-16 and UTF-32 use 16-bit units and 32-bit units respectively. All three can be used to encode all of the characters in the Unicode Standard but can be used in different contexts. UTF-8 is most common on the web, UTF-16 is used by Java and Windows. UTF-8 and UTF-32 are both used by Linux and Unix systems (Sources: FAQ – UTF-8, UTF-16, UTF-32 & BOM, The Unicode Standard, Version 11.0: 2.5 Encoding Forms).

Unicode’s predecessor, ASCII, contained only 128 characters based on Western European languages, making it impossible to encode characters from world languages with other scripts (i.e. other writing systems) and impossible to encode all of the characters in some Western European scripts, such as ‘é’, making it mainly useful for texts written only in English (Source: BBC Bitesize – GCSE Computer Science – Hexadecimal and character sets – Revision 5).

Some scripts remain unsupported by Unicode, though it strives to be universal. The Script Encoding Initiative (SEI) at the UC Berkeley’s Department of Linguistics aims to prepare formal proposals for the encoding of scripts and script elements not currently supported (Source: script encoding initiative). This shows the interconnectedness of Unicode as a computer standard and the fields of languages and linguistics.

In the following video, Lobsang Monlam speaks about why and how he created a series of Unicode fonts for the Tibetan language:

Leave a Comment