CINXE.COM
Character Sets and Encodings
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8" /> <title>Character Sets and Encodings</title> <meta name="description" content="Orientation for newcomers to Web internationalization about character Sets and encodings on the Web." /> <script> var f = { } // AUTHORS should fill in these assignments: f.directory = 'getting-started'+'/' // the path to this file, not including /International or the file name f.filename = 'characters' // the file name WITHOUT extensions f.authors = 'Richard Ishida, W3C' // author(s) and affiliations f.previousauthors = '' // as above f.modifiers = '' // people making substantive changes, and their affiliation f.searchString = 'gs-characters' // blog search string - usually the filename without extensions f.firstPubDate = '2006-01-16' // date of the first publication of the document (after review) f.lastSubstUpdate = { date:'2009-05-01', time:'09:44'} // date and time of latest substantive changes to this document f.status = 'notreviewed' // should be one of draft, review, published, notreviewed or obsolete f.path = '../' // what you need to prepend to a URL to get to the /International directory // AUTHORS AND TRANSLATORS should fill in these assignments: f.thisVersion = { date:'2023-03-29', time:'16:15'} // date and time of latest edits to this document/translation f.contributors = '' // people providing useful contributions or feedback during review or at other times // also make sure that the lang attribute on the html tag is correct! f.sources = '' // describes sources of information // TRANSLATORS should fill in these assignments: f.translators = 'xxxNAME, ORG' // translator(s) and their affiliation - a elements allowed, but use double quotes for attributes f.breadcrumb = 'gettingstarted' f.additionalLinks = '' </script> <script src="characters-data/translations.js"> </script> <script src="../javascript/doc-structure/article-dt.js"> </script> <script src="../javascript/boilerplate-text/boilerplate-en.js"> </script> <!--TRANSLATORS must change -en in the line just above to the subtag for their language! --> <script src="../javascript/doc-structure/article-2022.js"> </script> <script src="../javascript/articletoc-2022.js"></script> <link rel="stylesheet" href="../style/article-2022.css" /> <link rel="copyright" href="#copyright"/> <link rel="stylesheet" href="characters-data/local.css" /> </head> <body> <header> <nav id="mainNavigation"></nav><script>document.getElementById('mainNavigation').innerHTML = mainNavigation</script> <h1>Character Sets and Encodings</h1> </header> <div id="audience"> <div id="updateInfo"></div><script>document.getElementById('updateInfo').innerHTML = g.updated</script> </div> <p>This page provides some orientation for newcomers to Web internationalization who don't really know where to start. The aim is to ease you gently into some of the material on the site.</p> <p>You can find a selection of more detailed articles using the links to the right. Once you get some ideas from this page, you will probably just use <a href="/International/i18n-drafts/nav/learn">Learn to internationalize</a>, or the <a href="../resource-index">site search</a>.</p> <section id="what"> <h2>What's it about?</h2> <p>A character set is a collection of letters and symbols used in a writing system. For example, the ASCII character set covers letters and symbols for English text, ISO-8859-6 covers letters and symbols needed for many languages based on the Arabic script, and the Unicode character set contains characters for most of the living languages and scripts in the world.</p> <p>Characters in a character set are stored as one or more bytes in a computer. Each byte or sequence of bytes represents a given character. A character encoding is the key that maps a particular byte or sequence of bytes to particular characters that the font renders as text.</p> <p>There are many different character encodings. If the wrong encoding is applied to the bytes in memory, the result will be unintelligible text. It is therefore important, if people are to read your content, that you correctly label the character encoding used.</p> <p><strong>Learn more...</strong></p> <p><a href="../questions/qa-what-is-encoding">Character encodings for beginners</a> explains some of the basic concepts about character encodings, and why you should care. </p> <p><a href="../articles/definitions-characters/">Character encodings: Essential concepts</a> provides explanations of terminology such as Unicode, character sets, coded character sets, character encodings, the document character set, and character escapes.</p> </section> <section id="choosing"> <h2>Choosing an encoding</h2> <div class="sidenoteGroup"> <p><strong>Everyone developing content</strong>, whether content authors or programmers, should use the UTF-8 character encoding, unless there are very special reasons for using something else. (If you decide to not use UTF-8, you must choose one of the few encodings that are interoperably implemented across all browsers.)</p> <p> </p> <div class="sidenote"> <strong>Learn more...</strong> <p>HTML & CSS authors</p> <ul> <li><a href="https://www.w3.org/International/techniques/authoring-html.en?open=charset&open=choosing#choosing">Choosing and applying a character encoding</a></li> </ul> <p>Spec authors</p> <ul> <li><a href="https://www.w3.org/International/techniques/developing-specs.en?open=characters&open=char_choosing#char_choosing">Choosing character encodings</a></li> </ul> <p>Server setup</p> <ul> <li><a href="https://www.w3.org/International/techniques/server-setup#choosing">Choosing a character encoding</a></li> </ul> </div> </div></section> <section id="using" style="clear:both;"> <h2>Declaring and applying an encoding</h2> <div class="sidenoteGroup"> <p><strong>Content developers and programmers</strong> must ensure that the character encoding used for a document or page is declared in the right way.</p> <p>You must also ensure that your data is saved in the encoding you have chosen, it is not sufficient to just label it.</p> <p>(Note that with XHTML, encoding declarations are not always straightforward; they require an understanding of <a href="/International/articles/serving-xhtml/">'standards' vs. 'quirks' modes</a>, and the impact of the XML declaration.) </p> <p><strong>Content developers and webmasters</strong> may also need to ensure that the <em>server</em> delivers content with the correct character encoding declarations, since server settings can override in-document declarations.</p> <div class="sidenote"> <strong>Learn more...</strong> <p>HTML & CSS authors</p> <ul> <li><a href="https://www.w3.org/International/techniques/authoring-html.en?open=charset&open=indoc#indoc">Declaring the character encoding for HTML</a></li> <li><a href="https://www.w3.org/International/techniques/authoring-html.en?open=charset&open=css#css">Declaring the character encoding for a CSS style sheet</a></li> </ul> <p>Spec developers</p> <ul> <li><a href="https://www.w3.org/International/techniques/developing-specs.en?open=characters&open=char_identifying#char_identifying">Identifying character encodings</a></li> </ul> <p>Server setup</p> <ul> <li><a href="https://www.w3.org/International/techniques/server-setup#setting">Setting the HTTP charset parameter</a></li> <li><a href="https://www.w3.org/International/techniques/server-setup#htaccess">Setting character encoding information using .htaccess</a></li> </ul> </div> </div> </section> <section id="escapes"> <h2>Escapes</h2> <div class="sidenoteGroup"> <p>Escapes are a way of representing a character using only ASCII text. They provide a way of representing characters that are not available in the character encoding you are using, or a way of avoiding the use of the character for other reasons (such as when they may conflict with syntax). You should be clear on when and how these escapes should be used.</p> <p> </p> <p> </p> <div class="sidenote"> <strong>Learn more...</strong> <p>HTML & CSS authors</p> <ul> <li><a href="https://www.w3.org/International/techniques/authoring-html.en?open=charset&open=escapes#escapes">Using escapes to represent characters</a></li> </ul> <p>SVG authors</p> <ul> <li><a href="https://www.w3.org/International/techniques/authoring-svg#escapes">Using escapes to represent characters</a></li> </ul> <p>XML authors</p> <ul> <li><a href="https://www.w3.org/International/techniques/authoring-xml#escapes">Using escapes to represent characters</a></li> </ul> <p>Spec developers</p> <ul> <li><a href="https://www.w3.org/International/techniques/developing-specs.en?open=characters&open=char_escapes#char_escapes">Designing character escapes</a></li> </ul> </div> </div></section> <section id="address" style="clear: both;"> <h2>Web addresses</h2> <div class="sidenoteGroup"> <p>Web addresses can also include non-ASCII characters. The user does little other than click on the appropriate link or enter the text as they see it, the heavy lifting is done by the user agent, but you may be interested to know how this works.</p> <p>Specification developers should design their specifications so that non-ASCII web addresses can be used.</p> <div class="sidenote"> <strong>Learn more...</strong> <p>HTML & CSS authors</p> <ul> <li><a href="https://www.w3.org/International/articles/idn-and-iri/">An Introduction to Multilingual Web Addresses</a></li> </ul> </div> </div> </section> <footer id="thefooter"></footer><script>document.getElementById('thefooter').innerHTML = g.bottomOfPage</script> <script>completePage()</script> </body> </html>