CINXE.COM

Where is my Character?

<!doctype HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta http-equiv="Content-Language" content="en-us"> <meta name="keywords" content="Unicode Standard, characters"> <title>Where is my Character?</title> <link rel="stylesheet" type="text/css" href="https://www.unicode.org/webscripts/standard_styles.css"> </head> <body> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <!-- BEGIN HEADER BAR --> <tr> <td colspan="2"> <table width="100%" border="0" cellpadding="0" cellspacing="0"> <tr> <td class="icon" style="width:38px; height:35px"> <a href="https://www.unicode.org/"> <img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle" alt="[Unicode]" width="34" height="33"></a> </td> <td class="icon" style="vertical-align:middle"> <a class="bar"> </a> <a class="bar" href="https://www.unicode.org/consortium/newcomer.html"><font size="3">General Information</font></a> </td> <td class="bar"> <a href="https://www.unicode.org/main.html" class="bar">Tech Site</a> | <a href="https://www.unicode.org/sitemap/" class="bar">Site Map</a> | <a href="https://www.unicode.org/search" class="bar">Search </a> </td> </tr> </table> </td> </tr> <tr> <td colspan="2" class="gray">&nbsp;</td> </tr> <!-- END HEADER BAR --> <tr> <td valign="top" width="25%" class="navCol"> <table class="navColTable" border="0" width="100%" cellspacing="4" cellpadding="0"> <tr> <td class="navColTitle">Contents</td> </tr> <tr> <td valign="top" class="navColCell"><a href="#Location">Location</a></td> </tr> <tr> <td valign="top" class="navColCell"><a href="#Variant_Shapes">Variant Shapes</a></td> </tr> <tr> <td valign="top" class="navColCell"><a href="#Duplicates">Duplicates</a></td> </tr> <tr> <td valign="top" class="navColCell"><a href="#Submissions">Submissions</a></td> </tr> </table> <table class="navColTable" border="0" width="100%" cellspacing="4" cellpadding="0"> <tr> <td class="navColTitle">Related Links</td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/faq/char_combmark.html">FAQ on Combining Marks</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/faq/han_cjk.html">FAQ on Chinese, Japanese &amp; Korean</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/faq/indic.html">FAQ on Indic Scripts &amp; Languages</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/roadmaps">Roadmaps</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/standard/standard.html">About the Unicode Standard</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/charts/">Code Charts</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/ucd/">Unicode Character Database</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/charts/charindex.html">Unicode Character Name Index</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://www.unicode.org/charts/unihan.html">Unihan Database</a></td> </tr> <tr> <td valign="top" class="navColCell"><a href="../unsupported.html">As Yet Unsupported Scripts</a></td> </tr> <tr> <td valign="top" class="navColCell"><a href="../supported.html"> Supported Scripts</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="../../alloc/Pipeline.html">Proposed New Characters (Pipeline Table)</a></td> </tr> <tr> <td valign="top" class="navColCell"><a href="/pending/proposals.html"> Submitting New Characters or Scripts</a></td> </tr> <tr> <td valign="top" class="navColCell"> <a href="https://github.com/unicode-org/last-resort-font"> Last Resort Font</a></td> </tr> </table> </td> <!-- BEGIN CONTENTS --> <td> <table> <tr> <td class="contents" valign="top"> <blockquote> <h1 align="left">Where is my Character?</h1> <p>If you are trying to find a specific character in the <a href="https://www.unicode.org/versions/latest/">Unicode Standard</a>, the first place to go is the <a href="https://www.unicode.org/charts/index.html">code charts</a>. The code charts are organized into blocks, which are groupings of related characters.</p> <p>For each character defined in Unicode you will find an assigned <i>code point: </i>a hexadecimal number that is used to represent that character in computer data.</p> <p>The very term <i>character</i> is rather ambiguous, and may be interpreted broadly or narrowly. In this document, we&#39;ll use a very broad sense. For more details, see <a href="https://www.unicode.org/reports/tr17/#CharactersVsGlyphs">UTR #17: Character Encoding Model</a>.</p> <h2><a name="Location">Location</a></h2> <p>You may not find the character in what you think is the obvious spot. While the characters in Unicode are grouped into blocks, this is only a rough grouping because characters can be categorized many different ways. In particular, punctuation and symbols are applicable across a very wide range of usages and scripts (writing systems). Even the notion of a <i>script</i> itself is not well-defined; text in a given language may make use of characters from multiple scripts. For example, the digits 0-9 are in widespread use; the Devanagari <i>danda</i> is used across many Indic scripts.</p> <p>Thus you may need to look in several locations to find your character. If you are using the book, you may find the printed character index in the back of the standard helpful. The same data is available online as a plain text file, <a href="https://www.unicode.org/Public/UCD/latest/ucd/Index.txt">Index</a>. Or you can use the web version of the <a href="https://www.unicode.org/charts/charindex.html">Unicode Character Name Index</a>. You can also do a text search in the online Unicode names list. For example, suppose you were searching for a &quot;Japanese kome&quot;, the character &#x203B;. By opening up the <a href="https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt">NamesList.txt</a> in your browser, and searching for &quot;Japanese kome&quot;, you would find it under the entry:</p> <div> <blockquote> <p><tt>203B REFERENCE MARK<br> = Japanese kome<br> = Urdu paragraph separator<br> x (tibetan ku ru kha bzhi mig can - 0FBF)</tt></p> </blockquote> </div> <p>Documentation regarding the syntax conventions of the online Unicode names list can be found in <a href="https://www.unicode.org/Public/UCD/latest/ucd/NamesList.html">Names List File Format</a>.</p> <p>For Han characters (Chinese, Japanese, and Korean) you can find the character you are looking for by using the printed Han Radical-Stroke Index in the book or by using the the online web <a href="https://www.unicode.org/charts/unihan.html">Unihan Database</a>.</p> <p>There are auxiliary charts which contain the Unicode characters organized in different ways. You may sometimes find that useful in finding your character. For example, see <a href="https://www.unicode.org/charts/collation/">Collation charts</a>, <a href="https://www.unicode.org/charts/script/index.html"> Script charts</a>, <a href="https://www.unicode.org/charts/case/">Case Mapping charts</a>, or <a href="https://www.unicode.org/charts/normalization/"> Normalization charts</a>. If you know what legacy character encoding your character is in, you might be able to find it in the <a href="https://icu.unicode.org/charts/charset">ICU Character Set Mapping Tables</a>.</p> <h2><a name="Variant_Shapes">Variant Shapes</a></h2> <p>You may not find a character simply because the charts do not specify the exact shape; they only provide a representative shape for identification. For example, a lowercase Cyrillic <i>p</i> could appear with any of the following character shapes (also called glyphs). The second is customary for italic in Russia, and the third is customary for italic in Serbia:</p> <div align="center"> <center> <table class="simple"> <tr> <th width="33%"><div align="center">Cyrillic <i>p</i></div></th> <th width="33%"><div align="center">Russian Italic</div></th> <th width="34%"><div align="center">Serbian Italic</div></th> </tr> <tr> <td style="text-align:center; width:33%"> <img border="0" src="cyrillic-p.gif" alt="cyrillic-p" width="42" height="49"></td> <td style="text-align:center; width:33%"> <img border="0" src="cyrillic-italic-p.gif" alt="cyrillic-italic-p" width="42" height="49"></td> <td style="text-align:center; width:34%"> <img border="0" src="cyrillic-italic-serbian-p.gif" alt="cyrillic-italic-serbian-p" width="39" height="38"></td> </tr> </table> </center> </div> <p>Characters may also take on different shapes in different contexts. So, for example,&nbsp;the Arabic character <i>hah</i> may have four different basic shapes.</p> <div align="center"> <center> <table class="simple"> <tr> <th width="50%"><div align="center">Representative shape in code chart</div></th> <th colspan="4" width="50%"><div align="center">Possible shapes in context</div></th> </tr> <tr> <td align="center"> <div align="center"><img border="0" src="heh-initial.gif" alt="heh-initial" width="39" height="38"></div></td> <td align="center"> <div align="center"><img border="0" src="heh-independent.gif" alt="heh-independentl" width="39" height="38"></div></td> <td align="center"> <div align="center"><img border="0" src="heh-final.gif" alt="heh-final" width="39" height="38"></div></td> <td align="center"> <div align="center"><img border="0" src="heh-medial.gif" alt="heh-medial" width="39" height="38"></div></td> <td align="center"> <p align="center"> <img border="0" src="heh-initial.gif" alt="heh-initial" width="39" height="38"></td> </tr> </table> </center> </div> <p>The character you are looking for may be represented as a <i>sequence</i> of code points in Unicode. Here are examples of such characters, and their representation as a sequence of code points.</p> <div align="center"> <center> <table class="simple"> <tr> <th align="left"><div align="center">Character</div></th> <th align="left"><div align="center">Code Points</div></th> <th align="left"><div align="center">Linguistic Usage</div></th> </tr> <tr> <th><div align="center"><img border="0" src="ch.gif" alt="ch" width="42" height="49"></div></th> <td>0063&nbsp;0068</td> <td>Slovak, traditional Spanish</td> </tr> <tr> <th><div align="center"><img border="0" src="t-sup-h.gif" alt="t-sup-h" width="42" height="49"></div></th> <td>0074 02B0</td> <td rowspan="3">Native American languages</td> </tr> <tr> <th> <div align="center"><img border="0" src="where_x_dot_below.gif" alt="x dot below" width="19" height="35"></div></th> <td>0078 0323</td> </tr> <tr> <th> <div align="center"><img border="0" src="where_lambda_stroke_comma.gif" alt="lambda stroke comma" width="20" height="41"></div></th> <td>019B 0313</td> </tr> <tr> <th> <div align="center"><img border="0" src="where_a_acute_ogonek.gif" alt="a acute ogonek" width="22" height="47"></div></th> <td>00E1 0328</td> <td rowspan="2">Lithuanian</td> </tr> <tr> <th> <div align="center"><img border="0" src="where_i_dot_acute.gif" alt="i dot acute" width="14" height="38"></div></th> <td>0069 0307 0301</td> </tr> <tr> <th><div align="center"><img border="0" src="where_to_semi.gif" alt="semi" width="23" height="33"></div></th> <td>30C8 309A</td> <td>Ainu in kana transcription</td> </tr> </table> </center> </div> <p>Similarly, you won&#39;t find the Indic <i>half-forms</i> in the code charts, since they are formed with a consonant + halant (virama). For example:</p> <div align="center"> <center> <table class="simple"> <tr> <th colSpan="2"><div align="center">Representative shapes in code chart</div></th> <th><div align="center">Display appearance</div></th> </tr> <tr> <td align="center" width="33%"><div align="center"><img src="deltaF1.gif" border="0" width="57" height="40" alt="indic half forms"></div></td> <td align="center" width="33%"><div align="center"><img src="deltaF2.gif" border="0" width="38" height="55" alt="indic half forms"></div></td> <td align="center" width="33%"><div align="center"><img src="deltaF3.gif" border="0" width="46" height="40" alt="indic half forms"></div></td> </tr> </table> </center> </div> <p>Other Devanagari ligatures such as <i>ksha</i> are coded with sequences, as shown in <a href="https://www.unicode.org/versions/Unicode15.0.0/ch12.pdf#G63004"><i>Table 12-4: Sample Devanagari Half-Forms</i></a> of the core specification. For example:</p> <div align="center"> <center> <table class="simple"> <tr> <th colSpan="3"><div align="center">Representative shapes in code chart</div></th> <th><div align="center">Display appearance</div></th> </tr> <tr> <td align="center" width="25%"><div align="center"><img src="deltaF1.gif" border="0" width="57" height="40" alt="devanagari ligature"></div></td> <td align="center" width="25%"><div align="center"><img src="deltaF2.gif" border="0" width="38" height="55" alt="devanagari ligature"></div></td> <td align="center" width="25%"><div align="center"><img src="deltaF4.gif" border="0" width="40" height="39" alt="devanagari ligature"></div></td> <td align="center" width="25%"><div align="center"><img src="deltaF5.gif" border="0" width="42" height="42" alt="devanagari ligature"></div></td> </tr> </table> </center> </div> <p>In addition, the joining control characters can be used to request specific appearances, as in <a href="https://www.unicode.org/versions/Unicode15.0.0/ch12.pdf#G59257"><i>Figure 12-8</i></a> of the core specification. For example:</p> <div align="center"> <table class="simple"> <tr> <th colSpan="4"><div align="center">Representative shapes in code chart</div></th> <th><div align="center">Display appearance</div></th> </tr> <tr> <td align="center" width="20%"><div align="center"><img src="deltaF1.gif" border="0" width="57" height="40" alt="joining control character"></div></td> <td align="center" width="20%"><div align="center"><img src="deltaF2.gif" border="0" width="38" height="55" alt="joining control character"></div></td> <td align="center" width="20%"><div align="center"><img src="deltaF6.gif" border="0" width="55" height="75" alt="joining control character"></div></td> <td align="center" width="20%"><div align="center"><img src="deltaF4.gif" border="0" width="40" height="39" alt="joining control character"></div></td> <td align="center" width="20%"><div align="center"><img src="deltaF7.gif" border="0" width="69" height="38" alt="joining control character"></div></td> </tr> </table> </div> <p>Unfortunately there are not yet such detailed block descriptions for all Indic scripts, so it may not be clear exactly which sequences to use. These should be forthcoming in the future. In the meantime, sometimes you may get an answer if you ask on the general <a href="https://www.unicode.org/consortium/distlist.html">Unicode public e-mail list</a>.</p> <h2><a name="Duplicates">Duplicates</a></h2> <p>In some rare instances, you will find apparently identical characters. In most cases, if not all, this is to maintain compatibility with the original source standards for Unicode: vendor, national, and international character standards in wide usage in 1990. For example, there are duplicate encodings in the following case:</p> <div align="center"> <table class="simple"> <tr> <td width="1"><img border="0" src="A-ring.gif" alt="a ring" width="42" height="49"></td> <td>Capital letter A with ring</td> </tr> <tr> <td width="1"> <img border="0" src="A-ring.gif" alt="angstrom sign" width="42" height="49"></td> <td>Angstrom sign</td> </tr> </table> </div> <p>There are also particular shapes of characters that are given separate code points in Unicode, such as the shapes of the Arabic character <i>hah</i> listed above. These were also added to Unicode because of pre-existing standards.</p> <p>For compatibility with pre-existing standards, there are characters that are equivalently represented either as sequences of code points or as a single code point called a <i>composite character.</i> For example, the <i>i</i> with 2 dots in <i>naïve</i> could be presented either as <i>i</i> + <i>diaeresis</i> (0069 0308) or as the composite character <i>i</i> + <i>diaeresis</i> (00EF).</p> <p>There are other cases where the order of two combining characters does not matter. For example, the pair of combining characters <i> acute</i> and <i>dot-below</i> can occur with either one first; both alternate orders are equivalent. The rules for when order is significant is precisely spelled out by the Unicode Standard.</p> <p>Due to the requirements for uniqueness — especially on the Internet — Unicode provides for a unique format, called <i>Form C.</i> This format always picks one of the equivalent code points (or sequences of code points) and not the other. It also picks a specific order where there are alternatives. For more information, see <a href="https://www.unicode.org/reports/tr15/">UTR #15: Unicode Normalization Forms</a>.</p> <p>In a very few cases, Unicode separates glyphs as distinct characters on the basis of whether they are treated as letters or not. For example, the following characters are distinguished on this basis, even though the range of possible shapes are the same.</p> <div align="center"> <table class="simple"> <tr> <td width="1"><img border="0" src="prime.gif" alt="prime" width="39" height="38"></td> <td><i>Modifier letter prime.</i> Is treated as a letter. Used to transcribe the &quot;soft&quot; sign in Cyrillic.</td> </tr> <tr> <td width="1"><img border="0" src="prime.gif" alt="prime" width="39" height="38"></td> <td><i>Prime. </i>Treated as a punctuation mark or symbol. Used in mathematics, and as a symbol for minutes (fractions of degrees).</td> </tr> </table> </div> <p>In those rare cases where this occurs, to decide which character to use you should consult the text of the Unicode Standard.</p> <h2><a name="Submissions">Submissions</a></h2> <p>Simply because a character or sequence of characters may have a different sorting order does <i>not</i> qualify it to be given a separate code point in Unicode. For more information, see <a href="https://www.unicode.org/reports/tr10/">UTR #10: Unicode Collation Algorithm</a>.</p> <p>Finally, your character may not yet be encoded in Unicode. There is a well defined <a href="/pending/proposals.html"> submission process for new characters or scripts</a>. This process verifies that the proposed character is in fact a candidate for encoding. In some cases, this process may not be straightforward.</p> <p>Because the Unicode Standard and ISO 10646 are synchronized in character codes, both organizations need to agree to the encoding of new characters. This process can require some time before a new character is accepted into the standard, and some time beyond that before it is fully supported in products.</p> <hr width="50%"> <div align="center"> <center> <table cellspacing="0" cellpadding="0" border="0"> <tr> <td><a href="https://www.unicode.org/copyright.html"> <img src="https://www.unicode.org/img/hb_notice.gif" border="0" alt="Access to Copyright and terms of use" width="216" height="50"></a></td> </tr> </table> </center> </div> </blockquote> </td> </tr> </table> </td> </tr> </table> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10