The case for romanizing Singhala


The purpose of these pages is to demonstrate the best way to digitize Singhala and other Indian languages whose writing systems are consistent with vyAkaraNa. vyAkaraNa, unlike English grammar, begins by defining speech as a set of atomic sounds meaningful to the native speaker, the akSara. The concept of phoneme is similar to that of akSara, but akSara has two well-defined aspects to it — varNa and rUpa or sound and shape. varNa is the equivalent of phoneme. Each akSara has a place in the hodiya according to its pronunciation. hodiya of a language is its phoneme chart. It is not an alphabet where the letters and phonemes do not have a direct relation. Each position in the hodiya has its pronunciation rule. As such, we can first layout the hodiya and then assign each position a shape. Devanagari and Singhala hodi together illustrate this with differences we can easily understand. (The Sanskrit words in this paragraph are spelled according to the Harvard-Kyoto scheme as seen on the Cologne University web site).

What is the goal in digitizing?
The attributes of an ideal digitized script would match those of English. Its basic text when all of its formatting is removed should still be readable even after received in anothet device. The input should be easy and familiar. Basic text editing features such as deletion of last entered item, search and replace should be available in the familiar manner. Its atomic linguistic units should have a means of sorting according to the known sorting order of the language.

Questions to ponder: Can this system accommodate past texts? Would it be possibe to scan and digitize old printed meterial with reasonable effort? Will it face problems in the future?

How letters happen
On the computer, everything is purely numbers carried in units called bytes. A byte can carry only a value between 0 and 255 or eight bits, a bit being a one or a zero. All English letters were encoded with numbers 32 through 127. Americans encoded English first, and it is universal in all computers. Then Western Europe became the next biggest computer market. Those languages have letters outside of the English alphabet. They were assigned codes from numbers 160 through 255. Numbers 0 to 31 and 128 to 159 were reserved for machine commands.

The computers today have a Last Resort font that is programmed with shapes of the above stated Latin letters. This does not prevent font designers assigning other letter shapes to the codes of these letters as other font face designs of Latin letters or as letters of entirely different languages. Before Unicode Specification, many languages used these codes to represent letters of their languages. Some of these standards are registerd as Code Pages and honored as default letter shapes when specified as such.

The keys on the keyboard have their own code set called scan codes. Then we have a keyboard layout driver program (for each language / variation). It translates the scan codes into the numeric code of the letter we intend to type. For instance, when we type the letter 't' on the English keyboard, the computer gets the code 116 in binary form. Had we paired the scan code for the key 't' with the code for the letter Thorn (þ), then the computer receives 254. Fonts have shapes or glyphs drawn for each numeric code. The program we are using at the time shows the letter associated with the key we typed using a font that we chose (or on default, using the fallback font of the computer). Sometimes we obtain the Glyph-not-found shape, surprisingly, even when the font has a specific glyph for that numeric code. (A special case we observed happens if you gave the charset as UTF-8 and included letters in the Latin-1 Extension code block -- the page must give ISO-8859-1 or Windows-1252 as its charset. Web behaves in mysterious ways.)

Unicode specification, Code Pages and Unicode Sinhala
Unicode Character Database is a specification for the computer industry that tries to standardize numeric codes for letters, punctuations, spacing commands etc. The idea is simple. It first takes the above defined single-byte codes and adds to them codes defined by multiple bytes. As the industry settled down, we now have Unicode standard applied for items in the Single-Byte character set (SBCS) readjusted by Microsoft's ANSI character set. In the WWW it is known as Windows-1252 Code page. It replaces code positions 128 through 159 with some more European letters and punctuation marks. The Double-byte Character Set (DBCS) is used alternatively with some Code Pages that were defined for scripts and languages like Cyrillic, Hebrew, Arabic and Greek. These Code Pages redefine Single-byte code positions for the shapes of their respective letters. Chinese - Japanese - Korean - Vietnamese (CJKV) scripts have standards competing with the Unicode specification. As such, these languages have multiple ways of digitizing their languages.

Singhala has its code block assigned out of the DBCS popularly called Unicode Sinhala. Unicode Sinhala was not defined taking vyAkaraNa into consideration. Instead, it is based on the Indian ISCII model and Abugida theory of Indic scripts. This already broken specification is aggravated further by SLS1134 standard that mauls Singhala. It is so bad that it is beyond repair. Unicode Consortium and Lankan government has to agree for any changes. Though Unicode is open for reasoning, it is inconceivable that Lankan bureaucracy would agree for any change as they have too much self interests riding on it.

Sending information over the Internet
Moving information from one digital device to another poses some constraints. When information is passed from one computer to another, it happens as a string of bytes. These bytes have communication commands that are necessary for the devices to understand what it is to do with the information received. These commands are also numeric codes just like codes of the letters etc. When sending single-byte text data, the codes of letters and commands are clearly different, and they can be sent in their raw form. However, when sending multi-byte data, they need to be re-encoded to ensure that they do not include bytes that look like commands. For example, the code for Singhala ayanna is stored in two bytes as 13 and 133. 13 by itself means 'carriage return' (as in the typewriter). 133 means 'Next Line' by Unicode and ellipse shape by Windows-1252 code page. (Windows-1252 is the default character set for HTML5, the latest web page standard).

Unicode Sinhala -- a betrayal of the user
The government of Sri Lanka went about digitizing Singhala the wrong-headed way. Instead of seeking advice from users and experts of the language, it set up ICTA to implement the advice of World Bank, lured by a $50 million loan guarantee. ICTA is made up of academics that are famous for their arrogance and conceit. Blocking public participation, they steamrolled though the American solution for Singhala that (wrongly) identifies base letter shapes, includes diacritics into the code set and imposed a font making method specific for Singhala. Some letter to sound relations stipulated are clearly wrong and collation method is completely new and violates the hodiya. The input is complex evidenced by the multiplicity of methods used for it: new physical keyboards, Wijesekera, Phonetic, Singlish and Google transliteration. The cost of upholding the Unicode Singhala is scandalous.

Romanized Singhala is the uncompromised solution
What we show here is the transcribing method. We first transcribe the hodiya into a well-defined Latin alphabet. That is, we map each letter of the hodiya to the Latin script. When you take each position of the hodiya, we take into consideration its definition of pronunciation. Then we find a Latin letter that we would normally use to Anglicize it. (Anglicizing is the process we use to write Singhala names inside English text). We avoid ambiguities in Anglicizing by finding alternatives. Here is how we romanized Sinhala. (Click the button at the bottom to see rendering in Singhala):

mizra síhala hoodiya
aaaæææiiiuuu
ûûûôôô 
eeeaioooau 
áíúéó 
äïüëö 
kkhgghñG 
cchjjhç 
tthddhND 
þþhððhnР
pphbbhmB 
yrlv 
zxsh 
Lf 

We can learn to read romanized Singhala (RS) in few minutes. We can type RS using the extended English keyboard that comes with every computer. There is a rapid Singhala keyboard layout we have developed that closely follows the English key layout. We have tested this system since 2004 by pairing it with an orthographic smartfont. Our font is world's first smartfont made for any written language. The next one created was made in 2008 for Fraktur (Black Letters). The smartfont dynamically changes shape of letters according to the orthography programmed into it. Our font is only a proof-of-concept. Therefore, it does not implement a true orthography. Singhala script has three orthographies, Singhala (has no ligatures), Sanskrit (includes ligatures) and Pali (has ligatures, no hal (consonant) akuru (letters) except at the halanta (ending consonant) position, uses touch-letter style in medial consonant clusters that do not combine into ligatures).

Romanized Singhala fully covers all three languages. It never deforms into garbage or into Glyph-not-found character. The reason is that we were careful to select only single-byte coded letters for the hodiya. As single-byte letters are seen everywhere, we can still see readable Singhala even where there is no Singhala smart font. Implementation of smartfont technology is now nearly universal with notable slow adapters: Internet Explorer (severe) and Google Chrome (negligible faults).

The pages here demonstrate how pages that violate HTML standard are quickly converted into HTML5 compliant web pages.