CINXE.COM

Guidelines for Writing System Support: Technical Details: Encodings and Unicode: Part 1

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Guidelines for Writing System Support: Technical Details: Encodings and Unicode: Part 1</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="keywords" content="UNESCO,Guidelines,WSI,WSIs,writing system,writing systems,encoding,encodings,Unicode"> <link rel="stylesheet" href="/cms/assets/misc/css/default.css" type="text/css"> <link rel="stylesheet" href="/cms/sites/nrsi/themes/default/_css/default.css" type="text/css"> <style type="text/css"> <!-- A.GlobalNavLink, A.GlobalNavLink:visited { color: #FFFF00; font-size: smaller; font-weight: bold; } --> </style> <!-- 2023-05-25 PKM Added for Google Analytics 4 --> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-FVXRGR2Q9V"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-FVXRGR2Q9V'); </script> <title>Guidelines for Writing System Support: Technical Details: Encodings and Unicode: Part 1</title> </head> <body style="padding:0; margin:0"> <style> .archive_notice { /* box-shadow: black 0pt 4pt 20px -8px inset; */ display: block; background-color: orange; font-size: 12pt; font-style: normal; font-weight: lighter; line-height: 100%; padding: 5pt; text-align: center; width: auto; } form { display: none } .webform::before { content: "Forms are disabled on this static version of the site."; display: block; width: fit-content; } </style> <div class="archive_notice"> This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: <a href="https://software.sil.org">software.sil.org</a>, <a href="https://scriptsource.org">ScriptSource</a>, <a href="https://silnrsi.github.io/FDBP/">FDBP</a>, and <a href="https://silnrsi.github.io/silfontdev/">silfontdev</a> </div> <table width="100%" height="100%" border="0" cellspacing="0" cellpadding="0"> <tr> <td style="background: #0068a6; padding-left:20; padding-top:10; white-space:nowrap;" width="110" valign="top"> <p><a href="http://www.sil.org/"> <!-- <img src="/cms/sites/nrsi/themes/default/_media/SIL_logo_left_column.gif" width="86" height="80" border="0"> --> <img src="/cms/sites/nrsi/themes/default/_media/SIL_Logo_TM_Blue_2014.png" width="85" height="95" border="0" alt=""> </a><br><br></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dhome%26site_id%3Dnrsi.html">Home</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dcontactus%26site_id%3Dnrsi.html">Contact Us</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dgeneral%26site_id%3Dnrsi.html">General</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dbabel%26site_id%3Dnrsi.html">Initiative B@bel</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dwsi_guidelines%26site_id%3Dnrsi.html">WSI Guidelines</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dencoding%26site_id%3Dnrsi.html">Encoding</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dencodingprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dunicode%26site_id%3Dnrsi.html">Unicode</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dunicodetraining%26site_id%3Dnrsi.html">Training</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dunicodetutorials%26site_id%3Dnrsi.html">Tutorials</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dunicodepua%26site_id%3Dnrsi.html">PUA</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dconversion%26site_id%3Dnrsi.html">Conversion</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dencconvres%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dconversionutilities%26site_id%3Dnrsi.html">Utilities</a></p> <p class="Cat4"><a class="Cat4" href="/cms/scripts/page.php%3Fid%3Dteckit%26site_id%3Dnrsi.html">TECkit</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dconversionmaps%26site_id%3Dnrsi.html">Maps</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dencodingresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dinput%26site_id%3Dnrsi.html">Input</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinputprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinpututilities%26site_id%3Dnrsi.html">Utilities</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinputtutorials%26site_id%3Dnrsi.html">Tutorials</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dinputresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dtypedesign%26site_id%3Dnrsi.html">Type Design</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dtypedesignprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dfontdesigntools%26site_id%3Dnrsi.html">Design Tools</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dfontformats%26site_id%3Dnrsi.html">Formats</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Dtypedesignresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloads%26site_id%3Dnrsi.html">Font Downloads</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloadsgentium%26site_id%3Dnrsi.html">Gentium</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloadsdoulos%26site_id%3Dnrsi.html">Doulos</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontdownloadsipa%26site_id%3Dnrsi.html">IPA</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Drendering%26site_id%3Dnrsi.html">Rendering</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Drenderingprinciples%26site_id%3Dnrsi.html">Principles</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Drenderingtechnologies%26site_id%3Dnrsi.html">Technologies</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Drenderingopentype%26site_id%3Dnrsi.html">OpenType</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Drenderinggraphite%26site_id%3Dnrsi.html">Graphite</a></p> <p class="Cat2"><a class="Cat2" href="/cms/scripts/page.php%3Fid%3Drenderingresources%26site_id%3Dnrsi.html">Resources</a></p> <p class="Cat3"><a class="Cat3" href="/cms/scripts/page.php%3Fid%3Dfontfaq%26site_id%3Dnrsi.html">Font FAQ</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dlinks%26site_id%3Dnrsi.html">Links</a></p> <p class="Cat1"><a class="Cat1" href="/cms/scripts/page.php%3Fid%3Dglossary%26site_id%3Dnrsi.html">Glossary</a></p> <br> </td> <td valign="top" style="padding:0" xwidth="650"> <div style="background: #6699CC url(/cms/sites/nrsi/themes/default/_media/home_banner_gradient.gif) no-repeat right; padding:0 0 0 25; height:36px; margin:0; color:#FFFFFF;"> <p style="font-family:Times New Roman; font-size:25px; color:#FFFFFF; padding:10 0 0 0; margin:0 0 0 0">Computers & Writing Systems</p> </div> <div style="padding:0 0 0 0; background-color:#000000; color:#FFFFFF"> <table width='100%'> <tr> <td style="padding: 0 0 0 25px"><a class="GlobalNavLink" href="http://www.sil.org/">SIL HOME</a> | <a class="GlobalNavLink" href="https://software.sil.org/products/">SIL SOFTWARE</a> | <a class="GlobalNavLink" href="/support.html">SUPPORT</a> | <a class="GlobalNavLink" href="https://www.givedirect.org/donate/?cid=13536">DONATE</a> | <a class="GlobalNavLink" href="/privacy-policy.html">PRIVACY POLICY</a> </td> <td align='right' width='20%'> <script async src="https://cse.google.com/cse.js?cx=0760bf09a6bff4b0c"></script><style>.gsc-control-cse {padding: 0.6em; min-width: 10em; width: 18em; max-width: 20em} form.gsc-search-box {display: unset;}</style><div class="gcse-search"></div> </td> </tr> </table> </div> <div style="padding:0 25 25 25"> <p class='CategoryPath'>You are here: <a class='CategoryPath' href='/cms/scripts/page.php%3Fid%3Dgeneral%26site_id%3Dnrsi.html'>General</a> &gt; <a class='CategoryPath' href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines%26site_id%3Dnrsi.html'>WSI Guidelines</a><br> Short URL: <a href='/wsi_guidelines_sec_6_1.html'>https://scripts.sil.org/WSI_Guidelines_Sec_6_1</a></p> <!-- --> <!-- <div class='Warning' > <p class='Warning_heading' > Site unavailability </p> <p> Due to essential repairs, this website may be unavailable at times during September 6 (Tue) and 7 (Wed). We apologize for the inconvenience. </p> </div> --> <h1>Guidelines for Writing System Support: Technical Details: Encodings and Unicode: Part 1 </h1> <p> <span class='author_date_hits'>Peter Constable, 2003-09-05</span></p><div class='Sidebar'><p><span class='Runin'>UNESCO project Initiative B@bel</span></p> <p>A complete index of all SIL's contributions to UNESCO‘s project Initiative B@bel can be found <a href='/cms/scripts/page.php%3Fid%3Dbabel%26site_id%3Dnrsi.html'>here</a>.</p> </div><p></p> <div class='Sidebar'><p><span class='Runin'>Guidelines Table of Contents</span></p> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_1%26site_id%3Dnrsi.html'>Section 1: Components of a Writing System Implementation</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_2%26site_id%3Dnrsi.html'>Section 2: The Process of WSI Development</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_3%26site_id%3Dnrsi.html'>Section 3: Roles and Actors</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_4%26site_id%3Dnrsi.html'>Section 4: Keys to Success</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_5_1%26site_id%3Dnrsi.html'>Section 5: Technical Details: Characters, Codepoints, Glyphs</a></p> <ul class='dListUnordered'> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_5_1%26site_id%3Dnrsi.html'>Part 1: Characters</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_5_2%26site_id%3Dnrsi.html'>Part 2: Codepoints and Glyphs</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_5_3%26site_id%3Dnrsi.html'>Part 3: Keystrokes and Codepoints</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_5_4%26site_id%3Dnrsi.html'>Part 4: Further Reading</a></li> </ul> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_6_1%26site_id%3Dnrsi.html'>Section 6: Technical Details: Encoding and Unicode</a></p> <ul class='dListUnordered'> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_6_1%26site_id%3Dnrsi.html'>Part 1: An Introduction to Encodings</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_6_2%26site_id%3Dnrsi.html'>Part 2: An Introduction to Unicode</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_6_3%26site_id%3Dnrsi.html'>Part 3: Adding New Characters and Scripts to Unicode</a></li> </ul> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_7%26site_id%3Dnrsi.html'>Section 7: Technical Details: Data Entry and Editing</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_8%26site_id%3Dnrsi.html'>Section 8: Technical Details: Glyph Design</a></p> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_9_1%26site_id%3Dnrsi.html'>Section 9: Technical Details: Smart Rendering</a></p> <ul class='dListUnordered'> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_9_1%26site_id%3Dnrsi.html'>Part 1: The Rendering Process</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_9_2%26site_id%3Dnrsi.html'>Part 2: Glyph Processing &mdash; Dumb Fonts</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_9_3%26site_id%3Dnrsi.html'>Part 3: Glyph Processing &mdash; Smart Fonts</a></li> <li><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_sec_9_4%26site_id%3Dnrsi.html'>Part 4: User Interaction</a></li> </ul> <p><a href='/cms/scripts/page.php%3Fid%3Dwsi_guidelines_glossary%26site_id%3Dnrsi.html'>Glossary</a></p> </div><p></p> <p class='TOCTitle'>Contents</p> <div class='TOC'> <ol> <li class='TOC2'><a href='#4d3ac3f7'>6.1&nbsp;&nbsp; An Introduction to Encodings</a> <ol> <li class='TOC3'><a href='#6eea8d1d'>6.1.1&nbsp;&nbsp; Text as numbers</a></li> <li class='TOC3'><a href='#a4633843'>6.1.2&nbsp;&nbsp; Industry standard legacy encodings</a></li> </ol> </li> </ol> </div> <p></p> <a name='4d3ac3f7'></a> <h2>6.1&nbsp;&nbsp; An Introduction to Encodings</h2> <p>Computer systems employ a wide variety of character encodings. The most important of these is Unicode. It is also important for us to understand other encodings, however, and how they relate to Unicode. This section introduces basic encoding concepts, briefly mentions legacy encodings, and gives an introduction to Unicode. It also describes the process of interacting with the Unicode Consortium to get new characters and scripts accepted into the standard.</p> <a name='6TxtNum'></a> <a name='6eea8d1d'></a> <h3>6.1.1&nbsp;&nbsp; Text as numbers</h3> <p><span class='KeyTerm'>Encoding</span> refers to the process of representing information in some form. In computer systems, we encode written language by representing the <span class='KeyTerm'>graphemes</span> or other <span class='KeyTerm'>text elements</span> of the writing system in terms of sequences of <span class='KeyTerm'>characters</span>, units of textual information within some system for representing written texts. These characters are in turn represented within a computer in terms of the only means of representation the computer knows how to work with: binary numbers. </p> <p>A <span class='KeyTerm'>character set encoding</span> (or <span class='KeyTerm'>character encoding</span>) is such a system for doing this. Any character set encoding involves at least these two components: a set of characters and some system for representing these in terms of the processing units used within the computer. </p> <a name='6IndLeg'></a> <a name='a4633843'></a> <h3>6.1.2&nbsp;&nbsp; Industry standard legacy encodings</h3> <p>Encoding standards are important for at least two reasons. First, they provide a basis for software developers to create software that provides appropriate text behaviors. Secondly, they make it possible for data to be exchanged between users. </p> <p>The <span class='SmallCaps'>ASCII</span> standard was among the earliest encoding standards, and was minimally adequate for US English text. It was not minimally adequate for British English, however, let alone fully adequate for English-language publishing or for most any other language. Not surprisingly, it did not take long for new standards to proliferate. These have come from two sources: standards bodies and independent software vendors. </p> <p>Software vendors have often developed encoding standards to meet the needs of a particular product in relation to a particular market. Among personal computer vendors, Apple created various standards that differed from IBM and Microsoft standards in order to suit the distinctive graphical nature of the Macintosh product line. Similarly, as Microsoft began development of Windows, the needs of the graphical environment led them to develop new codepages—ways of encoding character sets. These are the familiar Windows codepages, such as codepage 1252, alternately known as “Western”, “Latin 1” or “<span class='SmallCaps'>ANSI</span>”.</p> <p>The other main source of encoding standards is national or international standards bodies. A national standards body may take a vendor’s standard and promote it to the level of a national standard, or they may create new encoding standards apart from existing vendor standards. In some cases, a national standard may be adopted as an international standard, as was the case with <span class='SmallCaps'>ASCII</span>.</p> <a name='6IndCust h2: Industry standards versus custom encodings'></a> <p>It is important to understand the relationship between industry standard encodings and individual software products. Any commercial software product is explicitly designed to support a specific collection of character set encoding standards. </p> <p>Of course, the problem with software based solely on standards is that, if you need to work with a character set that your software does not understand, then you are stuck. This happens because software vendors have designed their products with specific markets in mind, and those markets have essentially never included people groups that are economically undeveloped or are not accessible to the vendor. This is not unfair on the part of software vendors; they can only support something they know about and that is predictable, implying a standard.</p> <p>When the available software does not support the writing systems they need to work with, linguists and others create their own solutions. They define their own character set and encoding, they “hack” out fonts that support that character set using that encoding so that they can view data, they create input methods (keyboards) to support that character set and encoding so that they can create data, and then go to work.</p> <p>Such practice is quite a reasonable thing to do from the perspective of doing what it takes to get work done. People who have needed to resort to this have been quite resourceful in creating their own solutions. There is a dark side to this, however. Although the user has defined a custom codepage, the software they are using is generally still assuming that some industry standard encoding is being used.</p> <p>The most serious problem with custom codepages, which affects data archiving and interchange, is that the data is useless apart from a custom font. Dependence on a particular font creates lots of hassles when exchanging data: you always have to send someone a font whenever you send them a document, or make sure they already have it. </p> <p>One context in which this is a particular problem is the Web. People often work around the problem by using Adobe Acrobat (PDF) format, but for some situations, including the Web, this is a definite limitation. If data is being sent to a publisher, who will need to edit and typeset the document, Acrobat is not a good option<span class='footnote_ref'><a href='#footnote_1' name='_ftnref_1'>1</a></span>. Custom codepages are especially a problem for a publisher who receives content from multiple sources since they are forced to juggle and maintain a number of proprietary fonts. Furthermore, if they are forced to use these fonts, they are hindered in their ability to control the design aspects of the publication.</p> <p>The right way to avoid all of these problems is to follow a standard encoding that includes these characters. This is precisely the type of solution that is made possible by Unicode, which is being developed to have a universal character set that covers all of the scripts in the world.</p> <div class='Note'><p class='Note_heading'>Copyright notice</p><p>(c) Copyright 2003 UNESCO and SIL International Inc.</p> </div> <br><hr clear='all'><p>Note: the opinions expressed in submitted contributions below do not necessarily reflect the opinions of our website.</p><hr> <hr align='left' class='footnote_rule'><table> <tr> <td class='footnote_number' align='top'><a href='#_ftnref_1' name='footnote_1'>1</a></td> <td class=footnote_text>Once a document is typeset and ready for press, however, Acrobat format is generally a good option.</td> </tr> </table> <hr> <p><small>© 2003-2024 <a href='http://www.sil.org/' target='_blank'>SIL International</a>, all rights reserved, unless otherwise noted elsewhere on this page.<br> Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our <a href="/privacy-policy.html">Privacy Policy</a>. <a href='/support.html'>Contact us here.</a></small></p> </div> </td> </table> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10