CINXE.COM
Character encodings
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8" /> <title>Character encodings</title> <meta name="description" content="How to declare the character encoding of a document in XML or HTML, and various useful links to related information" /> <script type="application/javascript"> var f = { } // AUTHORS should fill in these assignments: f.directory = ''+'/'; // the name of the directory this file is in f.filename = 'O-charset'; // the file name WITHOUT extensions f.authors = 'Bert Bos, W3C'; // author(s) and affiliations f.previousauthors = ''; // as above f.modifiers = 'Martin J. D眉rst, W3C; Richard Ishida, W3C'; // people making substantive changes, and their affiliation f.searchString = 'article-O-charset'; // blog search string - usually the filename without extensions f.firstPubDate = '1996-05-31'; // date of the first publication of the document (after review) f.lastSubstUpdate = { date:'2006-07-20', time:'09:00'} // date and time of latest substantive changes to this document f.status = 'obsolete'; // should be one of draft, review, published, notreviewed or obsolete f.path = './' // what you need to prepend to a URL to get to the /International directory // AUTHORS AND TRANSLATORS should fill in these assignments: f.thisVersion = { date:'2016-01-25', time:'09:46'} // date and time of latest edits to this document/translation f.contributors = ''; // people providing useful contributions or feedback during review or at other times // also make sure that the lang attribute on the html tag is correct! // TRANSLATORS should fill in these assignments: f.translators = 'xxxNAME, ORG'; // translator(s) and their affiliation - a elements allowed, but use double quotes for attributes f.breadcrumb = 'characters'; f.additionalLinks = '' </script> <script type="text/javascript" src="O-charset-data/translations.js"> </script> <script type="text/javascript" src="javascript/doc-structure/article-dt.js"> </script> <script type="text/javascript" src="javascript/boilerplate-text/boilerplate-en.js"> </script><!-- TRANSLATORS must change -en to the subtag for their language! --> <script type="text/javascript" src="javascript/doc-structure/article.js"> </script> <script type="text/javascript" src="javascript/articletoc-html5.js"></script> <link rel="stylesheet" href="style/article-2016.css" type="text/css" /> <link rel="copyright" href="#copyright"/> <!--[if lt IE 9]><script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script><![endif]--> <style>section { text-decoration: line-through; }</style> </head> <body> <header> <nav id="mainNavigation"></nav><script>document.getElementById('mainNavigation').innerHTML = mainNavigation</script> <h1>Character encodings</h1> </header> <section> <div id="updateInfo"></div><script>document.getElementById('updateInfo').innerHTML = g.updated</script> </section> <p class="unlinked"><strong>Warning:</strong> This page has been discontinued, and the information it contains is out of date and incorrect !</p> <p>You may want to try, instead, one of the following pages:</p> <ul> <li><a href="/International/tutorials/tutorial-char-enc/">Handling character encodings in HTML and CSS (tutorial)</a></li> <li><a href="/International/questions/qa-html-encoding-declarations">Declaring character encodings in HTML</a></li> </ul> <div class="obsolete"> <section> <h2 id="doccharset"><a href="#doccharset">The Document Character Set</a></h2> <p>The document character set for XML and HTML 4.0 is <a href="/International/tutorials/tutorial-char-enc/#Slide0040">Unicode</a> (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode. Read more about the <a href="https://www.w3.org/International/questions/qa-doc-charset">document character set</a>.</p> </section> <section> <h2 id="declaring"><a href="#declaring">Declaring encodings</a></h2> <p>It is very important that the character encoding of any XML or (X)HTML document is clearly labeled, so that clients can easily map these encodings to Unicode. This can be done in the following ways...</p> <ul> <li> <p>Send the 'charset' parameter in the <a href="/International/tutorials/tutorial-char-enc/#Slide0270">Content-Type header of HTTP</a>. Example: </p> <div class="example"><code>Content-Type: text/html; charset=utf-8</code></div> <p>To do this you will need to have access to server settings or serve your document via scripting (see <a href="/International/O-HTTP-charset">Setting the HTTP charset parameter</a> for more information).</p> </li> <li> <p>For XML (including XHTML), use the encoding pseudo-attribute in the XML declaration at the start of a document or the text declaration at the start of an entity. Example:</p> <div class="example"><code><?xml version="1.0" encoding="utf-8" ?> </code></div> <p>There are <a href="/International/articles/serving-xhtml/#declaration">potential issues</a> you should be aware of when using this with XHTML 1.0 served as HTML.</p> </li> <li> <p>For HTML or XHTML served as HTML, you should always use the <code><meta></code> tag inside <code><head></code>. Example:</p> <div class="example"><code><meta http-equiv="Content-Type" content="text/html;charset=utf-8" ></code></div> <p>For XHTML, you need a slash at the end:</p> <div class="example"><code><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /></code></div> </li> </ul> <p>For a discussion of which approach is best for which type of (X)HTML document, see the tutorial <a href="/International/tutorials/tutorial-char-enc/">Character sets & encodings in XHTML, HTML and CSS</a>.</p> <p>The examples above show declarations for <code>UTF-8</code> encoded content. This is likely to be the best choice of encoding for most purposes, but it is not the only possibility.</p> <p>If not using UTF-8 you should replace the <code>utf-8</code> text in the examples above with the name of the encoding you have chosen. You can see the full list of <a href="http://www.iana.org/assignments/character-sets">character encoding names registered by IANA</a> (long). In practice, a few encodings will be preferred, most likely: <code>ISO-8859-1</code> (Latin-1), <code>US-ASCII</code>, <code>UTF-16</code>, the other encodings in the ISO-8859 series, <code>iso-2022-jp</code>, <code>euc-kr</code>, and so on.</p> </section> <section> <h2 id="ensuring"><a href="#ensuring">Ensuring the declaration works</a></h2> <p>It is important to not only use the encoding declarations above in HTTP or content, but also:</p> <ul> <li> <p>Save your data in the appropriate encoding from your editing environment.</p> </li> <li> <p>Ensure that there is no conflict between what you declare in the document and what the server automatically applies, since server settings override in-document declarations.</p> </li> </ul> <p>For more information on these topics follow the links in <a href="https://www.w3.org/International/questions/qa-changing-encoding">Changing (X)HTML page encoding to UTF-8</a>. Although it is written from a UTF-8 perspective, it applies to whatever encoding you use.</p> </section> <section> <h2 id="bytheway"><a href="#bytheway">By the way</a></h2> <p>Values for the encoding attribute can be found in the <a href="http://www.iana.org/assignments/character-sets">IANA registry</a>. Note that these are called <em>charset</em> names, although in reality they refer to the encodings, not the character sets.</p> <p>If you want in-depth information related to the term 'charset', see an article by Dan Connolly (<cite><a href="../MarkUp/html-spec/charset-harmful.html">"Character Set" Considered Harmful</a></cite>) and a response by Glenn Adams (<a href="http://ksi.cpsc.ucalgary.ca/archives/HTML-WG/html-wg-95q2.messages/0078.html">Character Set Terminology, SC2 vs. SC18 vs. Internet Standards</a>).</p> <p>Historic note: Rick Jellife proposed to use the <a href="http://lists.w3.org/Archives/Public/w3c-sgml-wg/1996Dec/0104.html">SPREAD entities</a> from ERCS.</p> </section> <section> <h2 id="endlinks"><a href="#endlinks">Further reading</a></h2> <p>Helpful introductions:</p> <ul id="full-links"> <li> <p><a href="/International/getting-started/characters">Introducing Character Sets and Encodings</a></p> </li> <li> <p><a href="/International/tutorials/tutorial-char-enc/">Tutorial: Character sets & encodings in XHTML, HTML and CSS</a></p> </li> <li> <p><a href="/International/questions/qa-doc-charset">FAQ: Document character set</a></p> </li> </ul> <p>References in specifications:</p> <ul> <li> <p><a href="https://www.w3.org/Protocols/rfc2068/rfc2068.txt">charset parameter</a></p> </li> <li> <p><a href="/TR/REC-xml#charencoding">encoding pseudo-attribute</a></p> </li> <li> <p><a href="/TR/REC-xml#sec-prolog-dtd">xml declaration</a></p> </li> <li> <p> <a href="/TR/REC-xml#sec-TextDecl">text declaration</a></p> </li> <li> <p><a href="http://www.ietf.org/rfc/rfc3629.txt" title="RFC 3629">UTF-8</a></p> </li> <li> <p><a href="/TR/REC-html40/charset.html#h-5.2.2"><meta></a></p> </li> </ul> <p>Other links:</p> <ul> <li> <p><a href="/International/questions/qa-escapes">Using character entities and NCRs</a></p> </li> <li> <p><a href="/International/questions/qa-headers-charset">Checking HTTP Headers</a></p> </li> <li> <p><a href="/International/questions/qa-htaccess-charset">Setting 'charset' information in .htaccess</a></p> </li> <li> <p>Interesting test pages: <a href="http://www.unicode.org/iuc/iuc10/languages.html">The 10th Unicode Conference</a></p> </li> <li> <p>Characters and encodings in the <a href="/International/resource-index#charset">Topic index</a></p> </li> <li> <p>Characters and encodings in the <a href="/International/technique-index#charset">Techniques index</a></p> </li> </ul> </section> <footer id="thefooter"></footer><script type="text/javascript">document.getElementById('thefooter').innerHTML = g.bottomOfPage</script> <script type="text/javascript">completePage()</script> </div> </body> </html>