The Chinese, Japanese and Korean (CJK) languages include pictographic characters numbering in the tens of thousands. Storing these a digital format requires more than a single byte, because the eight bits can only represent up to 256 characters. As a result multibyte encoding schemes were created. These included language-specific implementations such as Shift-JIS for Japanese and Big-5 for Chinese, as well as more recent, unified standards such as UTF-8 and UTF-16. These operate by mapping characters using two (or more) bytes. This is crucial when thinking about encoding standards. If your system can’t auto-detect the encoding, it may interpret these as separate characters.
As mentioned at the outset of this article, locale is more than just a fancy word for location. In tech terms it encompasses not only the base language, but some of the other aspects of the region’s culture that have an impact on that language.
Using an IME takes some getting used to, but once you understand what it’s doing, it becomes second nature. While each one caters to the unique needs of the language it supports, understanding one can help get you started with the one you need for your locale.
Linux is the result of the work of thousands of developers worldwide. It’s only right that it has support for the languages all those volunteers speak, to say nothing of its users. The multilingual question is how does open source software support not just translations for all the different languages of the world, but its different and varied alphabets alongside notations. How you represent decimal places or thousandth separators, for instance?
Even if you’re not someone who interacts with CJK languages, being aware of multi-byte characters is still important. Newer updates to Unicode include entries such as emojis as part of the standard. So make sure to keep multi-byte characters in mind, lest your smiley face become a hospital emoji and cause a panic.
Exploring the locale files in /usr/ share/i18n/locales will reveal a number of data points included, such as:
Typically, IMEs are servers that sit behind the scenes. To expand on the example in this article, the Mozc IME has includes packages containing the service, data, and front-end support. These clients of Mozc will send it data, which it evaluates and sends back matching text. In the case of Japanese, you’re sending a representation of the pronunciation of the word(s) you want, often termed as “romanization” of the text. Japanese has a relatively small number of syllabic sounds, so it’s easy to support their romanization. Mozc sends back a list of matching words written in hiragana (basic Japanese letters), katakana (an alternate alphabet reserved for borrowed words), and Kanji (pictograph characters co-opted from Chinese).
Translation work is a vital aspect of enabling the take-up of open source software and Linux-based distros around the world. While English comprehension is often a perceived requirement for dev work, the truth is much of the world doesn’t speak English, let alone read it. So knowing how FOSS handles translations, more often called localisation, is key to growing wider adoption. Being a good open source citizen means being supportive of all these brethren, so let’s explore the localisation features of Linux!
>> LC_ADDRESS: the format for addresses, such as whether street and number or state/province comes first, ZIP/Post code formats, etc.
While the input method is active, the Mozc palette appears, and as you type roman letters, potential matches appear in an overlay window. You use the Tab key (or arrows) to select the correct one, and then press Enter. This inserts the text into the active application.