CINXE.COM
Is Unicode ready for you?
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Is Unicode ready for you?</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="keywords" content="unicode, convert, keyboard, mapping, input, encode, legacy, custom"> <link rel="stylesheet" href="/cms/assets/misc/css/default.css" type="text/css"> <link rel="stylesheet" href="/cms/sites/nrsi/themes/default/_css/default.css" type="text/css"> <style type="text/css"> <!-- A.GlobalNavLink, A.GlobalNavLink:visited { color: #FFFF00; font-size: smaller; font-weight: bold; } --> </style> <!-- 2023-05-25 PKM Added for Google Analytics 4 --> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-FVXRGR2Q9V"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-FVXRGR2Q9V'); </script> <title>Is Unicode ready for you?</title> </head> <body style="padding:0; margin:0"> <style> .archive_notice { /* box-shadow: black 0pt 4pt 20px -8px inset; */ display: block; background-color: orange; font-size: 12pt; font-style: normal; font-weight: lighter; line-height: 100%; padding: 5pt; text-align: center; width: auto; } form { display: none } .webform::before { content: "Forms are disabled on this static version of the site."; display: block; width: fit-content; } </style> <div class="archive_notice"> This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: <a href="https://software.sil.org">software.sil.org</a>, <a href="https://scriptsource.org">ScriptSource</a>, <a href="https://silnrsi.github.io/FDBP/">FDBP</a>, and <a href="https://silnrsi.github.io/silfontdev/">silfontdev</a> </div> <table width="100%" height="100%" border="0" cellspacing="0" cellpadding="0"> <tr> <td style="background: #0068a6; padding-left:20; padding-top:10; white-space:nowrap;" width="110" valign="top"> <p><a href="http://www.sil.org/"> <!-- <img src="/cms/sites/nrsi/themes/default/_media/SIL_logo_left_column.gif" width="86" height="80" border="0"> --> <img src="/cms/sites/nrsi/themes/default/_media/SIL_Logo_TM_Blue_2014.png" width="85" height="95" border="0" alt=""> </a><br><br></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dhome%26site_id%3Dnrsi.html">Home</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dcontactus%26site_id%3Dnrsi.html">Contact Us</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dgeneral%26site_id%3Dnrsi.html">General</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dbabel%26site_id%3Dnrsi.html">Initiative B@bel</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dwsi_guidelines%26site_id%3Dnrsi.html">WSI Guidelines</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dencoding%26site_id%3Dnrsi.html">Encoding</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dencodingprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dunicode%26site_id%3Dnrsi.html">Unicode</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dunicodetraining%26site_id%3Dnrsi.html">Training</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dunicodetutorials%26site_id%3Dnrsi.html">Tutorials</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dunicodepua%26site_id%3Dnrsi.html">PUA</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dconversion%26site_id%3Dnrsi.html">Conversion</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dencconvres%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dconversionutilities%26site_id%3Dnrsi.html">Utilities</a></p> <p class="Cat4"><a class="Cat4" href="/cms/scripts/page.php%3Fid%3Dteckit%26site_id%3Dnrsi.html">TECkit</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dconversionmaps%26site_id%3Dnrsi.html">Maps</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dencodingresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dinput%26site_id%3Dnrsi.html">Input</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinputprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinpututilities%26site_id%3Dnrsi.html">Utilities</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinputtutorials%26site_id%3Dnrsi.html">Tutorials</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinputresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dtypedesign%26site_id%3Dnrsi.html">Type Design</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dtypedesignprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dfontdesigntools%26site_id%3Dnrsi.html">Design Tools</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dfontformats%26site_id%3Dnrsi.html">Formats</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dtypedesignresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloads%26site_id%3Dnrsi.html">Font Downloads</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloadsgentium%26site_id%3Dnrsi.html">Gentium</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloadsdoulos%26site_id%3Dnrsi.html">Doulos</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloadsipa%26site_id%3Dnrsi.html">IPA</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Drendering%26site_id%3Dnrsi.html">Rendering</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Drenderingprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Drenderingtechnologies%26site_id%3Dnrsi.html">Technologies</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Drenderingopentype%26site_id%3Dnrsi.html">OpenType</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Drenderinggraphite%26site_id%3Dnrsi.html">Graphite</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Drenderingresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontfaq%26site_id%3Dnrsi.html">Font FAQ</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dlinks%26site_id%3Dnrsi.html">Links</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dglossary%26site_id%3Dnrsi.html">Glossary</a></p> <br> </td> <td valign="top" style="padding:0" xwidth="650"> <div style="background: #6699CC url(/cms/sites/nrsi/themes/default/_media/home_banner_gradient.gif) no-repeat right; padding:0 0 0 25; height:36px; margin:0; color:#FFFFFF;"> <p style="font-family:Times New Roman; font-size:25px; color:#FFFFFF; padding:10 0 0 0; margin:0 0 0 0">Computers & Writing Systems</p> </div> <div style="padding:0 0 0 0; background-color:#000000; color:#FFFFFF"> <table width='100%'> <tr> <td style="padding: 0 0 0 25px"><a class="GlobalNavLink" href="http://www.sil.org/">SIL HOME</a> | <a class="GlobalNavLink" href="https://software.sil.org/products/">SIL SOFTWARE</a> | <a class="GlobalNavLink" href="/support.html">SUPPORT</a> | <a class="GlobalNavLink" href="https://www.givedirect.org/donate/?cid=13536">DONATE</a> | <a class="GlobalNavLink" href="/privacy-policy.html">PRIVACY POLICY</a> </td> <td align='right' width='20%'> <script async src="https://cse.google.com/cse.js?cx=0760bf09a6bff4b0c"></script><style>.gsc-control-cse {padding: 0.6em; min-width: 10em; width: 18em; max-width: 20em} form.gsc-search-box {display: unset;}</style><div class="gcse-search"></div> </td> </tr> </table> </div> <div style="padding:0 25 25 25"> <p class='CategoryPath'>You are here: <a class='CategoryPath' href='/cms/scripts/page.php%3Fid%3Dencoding%26site_id%3Dnrsi.html'>Encoding</a> > <a class='CategoryPath' href='/cms/scripts/page.php%3Fid%3Dunicode%26site_id%3Dnrsi.html'>Unicode</a><br> Short URL: <a href='/utconvertq2.html'>https://scripts.sil.org/UTConvertQ2</a></p> <!-- --> <!-- <div class='Warning' > <p class='Warning_heading' > Site unavailability </p> <p> Due to essential repairs, this website may be unavailable at times during September 6 (Tue) and 7 (Wed). We apologize for the inconvenience. </p> </div> --> <p><span class='item_series'>When to Convert to Unicode</span></p><h1>Is Unicode ready for you? </h1> <p> <span class='author_date_hits'>Albert Bickford, Jim Brase and Lorna Priest, 2007-05-11</span></p><div class='Sidebar'><p><a href='/cms/scripts/page.php%3Fid%3Dutconvert2unicode%26site_id%3Dnrsi.html'>When to Convert to Unicode</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq1%26site_id%3Dnrsi.html'>What is Unicode? and Why do I need to use Unicode?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq2%26site_id%3Dnrsi.html'>Is Unicode ready for you?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq3%26site_id%3Dnrsi.html'>Are there fonts available that will work for you?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq4%26site_id%3Dnrsi.html'>Can you type all the characters you need?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq5%26site_id%3Dnrsi.html'>Does available Unicode software meet your needs?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq6%26site_id%3Dnrsi.html'>Is anyone requiring you to use Unicode?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq7%26site_id%3Dnrsi.html'>Is software going to force Unicode on you?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq8%26site_id%3Dnrsi.html'>Is it time to archive your data?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq9%26site_id%3Dnrsi.html'>Is the technical expertise to do the conversion available to you?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq10%26site_id%3Dnrsi.html'>Are you ready to learn about how to use Unicode?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq11%26site_id%3Dnrsi.html'>Do your colleagues use Unicode?</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dutconvertq12%26site_id%3Dnrsi.html'>Are you willing and able to “straddle the fence”?</a></p> </div><p></p> <p>Unicode contains a huge inventory of characters, currently over 100,000. But, that doesn’t guarantee that it will have every character that you need. Now, there has been a major effort over almost two decades to identify all the characters that need to be in Unicode, and most people will find that their data can be represented in Unicode with no problem. But, if the language you work in happens to have one of those rare characters that hasn’t yet been added to Unicode, or worse, a whole script that isn’t yet included, then the first order of business is to request that it be added. </p> <p>What do you do if there is no established orthography for the language? That, of course, gives you much greater flexibility. As long as you choose from among the characters already available in Unicode and follow standard conventions in how those characters are used, there should be no problem. For more details on this subject, see <a href='/cms/scripts/page.php%3Fid%3Dorthographydev%26site_id%3Dnrsi.html'>Orthography development in relation to Unicode</a>.</p> <p>But most people reading this article need to work with an established orthography. So you need to check to see if Unicode will support it. Most major languages are already fully-supported, but minority languages may not be. So, make a list of all the characters that you need to use. You will probably need help from a Unicode expert to know if the characters are in Unicode, but you can get started on your own by listing out what you need, then take your list to the expert. </p> <p>Here are some of the issues you should consider:</p> <ul class='dListUnordered'> <li>Upper and lower case</li> <li>Diacritics</li> <li>Borrowed words</li> <li>Punctuation and other symbols</li> <li>Phonetic transcription</li> <li>Other languages and scripts you may use</li> </ul> <div class='Note'><p class='Note_heading'>More advanced:</p><p>Inventorying the Character Set and Comparing it to Unicode:</p> <ul class='dListUnordered'> <li><span class='KeyTerm'>Include upper-case as well as lower-case.</span> Some people think that if a character never occurs first in a word, they don’t need an upper-case version of it. Then, they run into a situation when they need to use all-caps. If your writing system regularly makes a distinction between upper and lower case (or any analogous difference in a character’s appearance that is not predictable from context), list an upper-case version of it.</li> <li>On the other hand, if the shape of the character changes based on its immediate context, then also <span class='KeyTerm'>list all the variant shapes</span>. For example, in Arabic-based scripts, letters change shape depending on whether they occur first, middle, or last in a word or stand alone. Unicode handles this by treating all the variant shapes as the same character and relying on smart fonts to give the right shape in each context. (See <a href='/cms/scripts/page.php%3Fid%3Dutconvertq3%26site_id%3Dnrsi.html'>here</a> for more information.)</li> <li><span class='KeyTerm'>Include diacritics in your list.</span> List all possible combinations of diacritics with base characters (again, remember to include upper-case) as well as combinations of two or more diacritics on the same letter. Unicode does provide ways to form arbitrary combinations of base characters and diacritics, but the most common combinations are also available pre-assembled as single characters. So, you’ll want to check if these “pre-composed” characters are available for the combinations that you use. If there aren’t, then you need to make sure that the diacritic is included as a separate character that can be combined with other characters.</li> <li><span class='KeyTerm'>Include any characters that only occur in borrowed words</span>.</li> <li><span class='KeyTerm'>Include punctuation characters and any other special symbols</span> other than ordinary word-building characters. Almost certainly they will be included, but if you use anything unusual that isn’t in a major language, you should check this out carefully.</li> <li><span class='KeyTerm'>Consider all languages that you work with</span>, including those that you may only use occasionally. Major languages are already included, so that’s not a problem. But if you want to exchange data with people working in related languages, you may need characters for those other languages.</li> <li><span class='KeyTerm'>Consider symbols you need for phonetic transcription.</span> All symbols currently approved by the <a href='http://www.arts.gla.ac.uk/IPA' target='_blank'><img src='/cms/assets/icons/offsite_link.png'> International Phonetic Association (IPA)</a> are already included. However, if you use a different transcription system, such as Americanist phonetic characters or phonetic symbols that are only used in a particular part of the world or a certain language family, you need to check.</li> <li>If any language that you work with has <span class='KeyTerm'>more than one script</span>, consider each script separately.</li> <li><span class='KeyTerm'>Don’t plan to depend on formatting such as underlining, superscripting, or italics in order to represent your characters.</span> For example, if you need to represent a superscript <span usv='02B0' class='USV_sprite_wrapper' style='display: inline-block; overflow: hidden; position: relative; zoom: 1; *display: inline; width: 10px; height: 10px;'><img src='/cms/sites/nrsi/media/usv/0280-02FF.0.12pt.png' border='0' style='position: absolute; left: -2px; top: -122px;'></span>, don’t plan to use an ordinary h and just apply superscripting to it. Besides being clumsy to type, this method is unreliable. If all the formatting ever gets stripped off your text, then the distinction between h and <span usv='02B0' class='USV_sprite_wrapper' style='display: inline-block; overflow: hidden; position: relative; zoom: 1; *display: inline; width: 10px; height: 10px;'><img src='/cms/sites/nrsi/media/usv/0280-02FF.0.12pt.png' border='0' style='position: absolute; left: -2px; top: -122px;'></span> will disappear. You will need to represent these two as separate characters in Unicode. (And, yes, Unicode does have a separate character for <span usv='02B0' class='USV_sprite_wrapper' style='display: inline-block; overflow: hidden; position: relative; zoom: 1; *display: inline; width: 10px; height: 10px;'><img src='/cms/sites/nrsi/media/usv/0280-02FF.0.12pt.png' border='0' style='position: absolute; left: -2px; top: -122px;'></span>.)</li> <li><span class='KeyTerm'>You don’t need to worry at this point about fine details of appearance</span>, as long as the character in Unicode is recognizable the same character as the one you are using. For example, some languages prefer an upper-case eng (<img src='/cms/sites/nrsi/media/1024_0_14x18.png' height='18' width='14'>) that is just a larger version of the lower case eng ŋ; others prefer one that looks like a regular upper case N with a tail (<img src='/cms/sites/nrsi/media/1024_2_14x18.png' height='18' width='14'>). These two shapes are considered “glyph variants” of the same character in Unicode and are represented the same way. You control which version you use through the fonts that you use. </li> <li>You <span class='Em'>do</span>, however, have to pay attention to how a character is used. <span class='KeyTerm'>Unicode sometimes includes more than one character with the same appearance.</span> For example, there is an apostrophe that is used for punctuation and a separate apostrophe that is used to represent glottalization. The two characters look alike, but one is a punctuation mark and the other is a word-building character. You might have thought of them as the “same” character until now, but they function differently in a writing system, and makes a difference for functions like selecting whole words, breaking lines, searching and sorting. So, you can’t just pick a character out of a Unicode chart because it looks right; you have to read the descriptions of each character to make sure you’ve got the right one. If you’ve been representing two characters the same way up until now, then in the process of conversion to Unicode, you will need to figure out some way to distinguish them. Besides the difference between punctuation marks and word-building characters, you also need to distinguish characters that are used as diacritics vs. ones with the same appearance which are full characters on their own.</li> <li>In general, <span class='KeyTerm'>Unicode itself is only concerned with the computer being able to recognize a character reliably.</span> Matters of appearance, such as variant shapes of letters in different contexts, preferences about letter shapes such as “a” vs. “ɑ”, and fine positioning of diacritics are not distinguished in Unicode itself, because they are either predictable from context (and thus should be handled by smart fonts), or because they are a matter of personal preference (and thus should be handled as formatting, i.e., by choosing what font you want to use to display the data or choosing options within the font).</li> </ul> </div><p></p> <p>In listing out the characters you need, you may find it helpful to look at the inventory of characters in the fonts that you are currently using. In general, you should verify that all of them are in Unicode. (If you are using some ISO standard character set, like the standard Windows Latin fonts, or Big 5 for Chinese, then all those characters are in Unicode.) </p> <p>Now, it could be there are characters in an old custom font that you never use. As long as you can guarantee that they don’t ever occur in your existing data, you don’t need to worry about them being in Unicode. But, if there is any doubt (for example, if they might have been typed by mistake), then it is best to plan to convert them to Unicode along with everything else.</p> <p>After doing this inventory and consulting with a Unicode expert to make sure all the characters you need are in Unicode, what happens if some characters are missing? If so, you have three options:</p> <ul class='dListUnordered'> <li>Decide to discontinue using the characters that are missing and use something else that <span class='Em'>is</span> in Unicode. You can’t always do this, of course, but sometimes it is the best option.</li> <li>There may be a font available that includes the character you need in its “<a href='page.php%3Fid%3Dglossary%26site_id%3Dnrsi.html#pua'>private use area</a>”, a section of Unicode that is intended for local customization. Or, you may be able to arrange to have the characters you need added to the <a href='page.php%3Fid%3Dglossary%26site_id%3Dnrsi.html#pua'>private use area</a> of a font. </li> <li>Get the missing character(s) or scripts into Unicode. </li> </ul> <p>The last two options require consulting with a font designer and/or someone who is in regular contact with the Unicode consortium. If you are in SIL you may contact the NRSI for help in making Unicode proposals. Non-SIL may find help through the <a href='http://linguistics.berkeley.edu/sei/' target='_blank'><img src='/cms/assets/icons/offsite_link.png'> Script Encoding Initiative</a>.</p> <a name='3c7f83a4'></a> <h3>Advanced Resources</h3> <ul class='dListUnordered'> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_6_3%26site_id%3Dnrsi.html'>Adding new characters and scripts to Unicode</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dorthographydev%26site_id%3Dnrsi.html'>Orthography development in relation to Unicode</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_3%26site_id%3Dnrsi.html'>Guidelines for Writing System Support: Roles and Actors</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Duttlegacymap%26site_id%3Dnrsi.html'>Creating a Chart of Your Legacy Mapping</a></li> </ul> <p>Back to <a href='/cms/scripts/page.php%3Fid%3Dutconvert2unicode%26site_id%3Dnrsi.html'>When to Convert to Unicode</a>.</p> <hr> <p><small>© 2003-2024 <a href='http://www.sil.org/' target='_blank'>SIL International</a>, all rights reserved, unless otherwise noted elsewhere on this page.<br> Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our <a href="/privacy-policy.html">Privacy Policy</a>. <a href='/support.html'>Contact us here.</a></small></p> </div> </td> </table> </body> </html>