CINXE.COM
Unicode Issues
<html> <head> <meta http-equiv="Content-Language" content="en-ca"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Unicode Issues</title> <link rel="stylesheet" type="text/css" href="../stylesheets/rheolaidd.css"> </head> <body><p> </p> <SCRIPT language=JavaScript> /* Milonic DHTML Website Navigation Menu. Written by Andy Woolley - Copyright 2003 (c) Milonic Solutions Limited. All Rights Reserved. Please visit http://www.milonic.co.uk/ for more information. */ </SCRIPT> <SCRIPT language=JavaScript src="../javamenuf/menugood.js" type=text/javascript></SCRIPT> <SCRIPT language=JavaScript src="../javamenuf/mmenu.js" type=text/javascript></SCRIPT> <h1>Unicode Issues</h1> <table border="0" cellpadding="10" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="60%" id="AutoNumber5"> <tr> <td width="100%"> <dl> <dt><a name="unicode ansi ascii"></a>What is Unicode, ASCII, and ANSI?</dt> <dd> <p>Unicode is a map, a chart of (what will one day be) all of the characters, letters, symbols, punctuation marks, etc. necessary for writing all of the world’s languages past and present.</p> <dd> <p>If you have ever tried typing in a non-English language using the Roman alphabet (e.g. French) you may recall memorising a set of numbers that signify accented letters. So that alt-130 is é, and alt-160 is á. This numbering system is called ASCII, which has been with us since the DOS days. A newer mapping commonly called ANSI, expanded on ASCII giving us capital versions of the accented letters Á É, and a few extra glyphs like Icelandic ð and þ. ANSI allows one to write in any of the major western European languages (but not all: the Welsh letters ŵ ŷ are not available on the ANSI mapping).</p> <dd> <p>ASCII and ANSI are pretty good as long as you are western European. These two mappings are extremely limited in that they may only code (i.e. assign a number to) 256 letters, so that there is no space to include other glyphs from other languages.</p> <dd> <p>Unicode fixes this problem, by providing enough space for over a million different symbols. Like the above two systems, each character is given a number, so that Russian Я is 042F, and the Korean won symbol ₩ is 20A9. (Note that all Unicode numbers are Hexadecimal, meaning that one counts by 16’s not 10’s, not a problem as users really don’t need to know the mapping numbers anyway.) So, although not yet totally comprehensive, Unicode covers most of the world’s writing systems. Most importantly, the mapping is consistent, so that any user anywhere on any computer has the same encoding as everyone else, no matter what font is being used.</p> <dt><a name="computers using Unicode"></a>Which computers can use Unicode? <dd> <p>As far as Windows goes, only NT, 2000, and XP take advantage of Unicode. In these operating systems, it is possible to read, type, print, etc. using Unicode mappings, providing of course that you have the appropriate font and keyboard drivers. With the other Windows, (95, 98, me), typing in Unicode is not really possible (<a target="_blank" href="http://www.alanwood.net/unicode/">Alan Wood</a> has a list of which software are Unicode friendly, along with much much more about Unicode). The 95, 98, me Windows versions do allow users to <u>view</u> Unicode though, with up-to-date <a target="_blank" href="http://channels.netscape.com/ns/browsers/download.jsp">Netscape</a> or Explorer versions. Unicode also works on recent Mac operating systems.</p> <dt><a name="unicode doesn’t do"></a>What doesn’t Unicode do? <dd> <p>Remember, Unicode is not a font, but basically a coding system, where symbols are given numbers. Most fonts only carry a fraction of the full Unicode inventory, so that the fonts on this site have glyphs for Syllabics, but not for Chinese or Malayalam. So if you try to read a Chinese site without a Unicode font with Chinese included, it’s not going to work.</p> <dd> <ul> <li> <p>Unicode does not have different code numbers for different versions of the same letter. For example, the regular <b>g</b> and the italic <b><i>g</i></b> have the same number (0067). This makes sense because a single letter may have many different shapes, depending on size, language, style, artistic design, and so on. (<a target="_blank" href="opentype.html">OpenType</a> fonts handle different styles very well)</p> <li> <p><a name="accentplacement"></a>Unicode is inconsistent with regards to which symbols get unique codes, and which do not. So that all of the accented letters of the European languages have their own code (Ő is 0150), but Native American symbols, like Guaraní g̃ have to be made up from two codes, 0067 (g) and 0303 (combining ~) or Dene Ų̀. Notice that the ~ and ` accents may not be placed very well, and depending on the font, makes for an unattractive look (<a target="_blank" href="opentype.html">OpenType</a> fonts can fix this).</p> <li> <p>Unicode is not complete. Several languages have yet to be encoded, and others are not going to be, like many fictional constructed languages. The natural languages which are not yet encoded will be added presently, as research determines which glyphs are necessary. However, and this is especially relevant to this site, some Syllabics symbols are lacking. I am always in the process of submitting reports to Unicode to expand to include these neglected characters.</p> </ul> <dt><a name="missing characters"></a>My language uses digraphs (made up two letters) and accented letters to represent one sound. Unicode does not have these unique characters. <dd> <p>This situation is common in most languages (English being an exception). So for example, Kaska <b>ts’</b>, Haisla <b>x̄°</b>, and Cahuilla <b>kʷ</b> are separate letters of their respective alphabets. In Cahuilla <b>kʷ </b>is not sorted with <b>k</b>, neither is Kaska <b>tl’</b> considered three letters (<b>t</b>+<b>s</b>+<b>’</b>).</p> <dd> <p>Although <b>x̄°</b> is one concept in Haisla, it is not treated as one character by Unicode. Many people have been unhappy with this fact, especially database programmers who are setting up Native language dictionaries and other materials which require correct sorting. However there is really no reason to complain. Computers do not understand “letters” as letters anyway, what we think of as lowercase <b>a</b> is to your computer 0061, and <b>ñ</b> is 00F1. The computer does not process language the same way humans do; so it doesn’t matter at all to your machine whether <b>x̄°</b> one character or three (0078 0304 00B0). The number of characters required to make up one “letter” has no impact on sorting either, as sorting is handled by language specifications in the operating system (Windows, Mac OS, Linux, etc.), not in the font. Even if <b>x̄°</b> had its own Unicode number, it still would not sort properly without Haisla language support. Finally, although Kaska <b>ts’</b> is one sound, the orthography treats is as several graphical letters anyway. When a word like <b>ts’á’</b> ‘plate’ is capitalised at the beginning of a sentence, the result is <b>Ts’á’</b>, and if the word is in all caps, it appears as <b>TS’Á’</b>. If <b>ts’</b> were a single letter, we would expect to see <b>*TS’á’</b> or <b>*Ts’Á’</b> (the *asterisk indicates that these examples are incorrect). Each language has its own orthographical rules and traditions. By not encoding every accented letter and digraph, and allowing characters to combine easily into single orthographical concepts, Unicode is able to fit neatly into all of the world’s languages. <dt>I want to type documents in my Native language, what does this mean to me? <dd> <h4><a name="Unicode and syllabics"></a>Syllabics:</h4> <p>There are a few Syllabics fonts out there, which follow Unicode. There are several more (including the <a target="_blank" href="http://www.nunatsiaq.com/">Nunatsiaq News</a> website font) which are not Unicode. The fonts on this website are Unicode.</p> <p>If you use a non-Unicode font, forget about other people being able to read your documents on their computers unless they download and install exactly the same font as you used. Also, forget about changing to a font of a different style, or using the document in the future when your current non-Unicode font is obsolete. The font will be no good for emails, messaging, or networks. There are obvious advantages to using Unicoded fonts! Yet if you just want to print out a document to give a hard-copy to someone, you don’t need Unicode fonts at all.</p> <p>At present, Syllabics users are stuck between both worlds. As the Unicode range for Syllabics is not complete; some people need to go beyond Unicode. This means that if you use a font and keyboard from this site, there is a slight chance that you will have typed characters that have not been encoded in Unicode. Thus other users will need the same font you used. Once these software short-comings have been rectified, I will update everything on this site to the accepted standards. All of the extra letters have been added to the “Private Use Area” (PUA), an area designed for individuals to place non-Unicode characters which will not interfere with other languages already encoded. Unfortunately, many older Microsoft products think that the PUA is Chinese, but this shouldn’t be a problem for most users. If it is, please email me and I will help.</p> <p>Writers/viewers of most dialects of Cree, Oji-Cree, Inuktitut and Naskapi are fortunate as Unicode has included all of their syllabics, so these languages do not need the special characters found only in this font. Most notably, the majority of the y-final top-rings in Moose Cree are absent. If you view a Cree site written in Unicode syllabics (on a different font) all of the characters should appear properly in the languagegeek.com fonts. The most pressing problem is that in some languages such as Dene, there is a big difference between the finals at the middle of the line and those at the top. Unicode cannot differentiate between these two finals. Also from Dene, a group dotted syllabics are completely absent. Dene (Beaver), Dene (Carrier) and Blackfoot are also missing a few characters each, and Northern Ojibway finals are completely neglected.</p> <h4><a name="cherokee and unicode"></a>Cherokee</h4> <p>The characters used in the modern Cherokee language have all been encoded in Unicode. The original script developed by Sequoyah included some additional symbols which have been abandoned. The Cherokee numerals are also not encoded.</p> <p>There are several non-standard encodings which have been developed over the years. Most notably, the Cherokee Nation currently uses a non-Unicode font.</p> <h4><a name="Unicode and roman orthography"></a>Roman Orthography</h4> <p>There has been a trend in recent years to develop Roman Orthographies for Native languages that contain neither diacritics (accents) nor special symbols not found in English. From a purely practical point of view, this makes a lot of sense, as speakers can use any font on any computer in a myriad of styles. Yet there is a good reason why linguists and others have created orthographies using accents and special symbols, namely the Roman alphabet has 26 letters, most languages have a lot more than 26 sounds. <a target="_blank" href="diacritics_and_digraphs.html">Ojibway</a> is a good example of Roman orthographies with and without diacritics. Aesthetically, and keep in mind this is coming from a font designer and language-geek, I find it endlessly interesting and important that each language <i>look</i> unique, and an unusual orthography is a great form of ethnic self-identity. What would French be without ç, ê, or à, Spanish without ñ, or if German lacked ä and ß. North American Native languages have their own collection of symbols, like Cheyenne ringed vowels (å etc.), Oneida upside-down "v" (ʌ), Athabaskan nasal vowels (ą ę į ų), and the Pacific coast schwas, raised consonants and ejectives (ə k<sup>w</sup> q̓) plus much more.</p> <p>For those North American languages which still use interesting characters, it is important to know whether these letters are widely available in many Unicode fonts (such as the Hawaiian long vowels ā ē ī ō ū). Other characters are found in fonts more rarely, but are still Unicoded (e.g. ə). There is a large group of symbols which have not been included in the Unicode charts, and must either be made up of two or more separate symbols (e.g. Ų̀), Ų (0172) and combining grave ` (0340). Finally, some Native letters are completely absent, and cannot easily and attractively be made by combining existing glyphs. <a target="_blank" href="opentype.html">OpenType</a> takes care of some of these issues, but in other cases, new characters should be proposed to Unicode.</p> </dl> </td> </tr> </table> <table border="0" cellpadding="10" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%" dir="ltr" id="AutoNumber3" align="left" height="27"> <tr> <td width="50%" height="7"> <p><a href="JavaScript:history.back(1)"> <img border="0" src="../new_images/arrowmarbleleft.gif" width="20" height="20" align="left" hspace="10">Previous Page</a></p></td> <td width="50%" height="7"> <p>Last Update: <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%B %d, %Y" startspan -->May 19, 2005<!--webbot bot="Timestamp" i-checksum="11293" endspan --></p></td> </tr> </table> <!-- Start of StatCounter Code --> <script type="text/javascript" language="javascript"> var sc_project=614325; var sc_partition=3; var sc_security="d707073b"; var sc_invisible=1; </script> <script type="text/javascript" language="javascript" src="http://www.statcounter.com/counter/counter.js"></script><noscript><a href="http://www.statcounter.com/" target="_blank"><img src="http://c4.statcounter.com/counter.php?sc_project=614325&java=0&security=d707073b&invisible=1" alt="hit counter html code" border="0"></a> </noscript> <!-- End of StatCounter Code --> </body> </html>