CINXE.COM

FAQ - Character Properties, Case Mappings and Names

<!doctype html> <html lang="en-us"> <head> <meta charset="utf-8"> <meta content="initial-scale=1.0" name="viewport"> <meta name="keywords" content="Character Properties, Case Mappings"> <meta name="description" content="Character Properties, Case Mappings and Names"> <title>FAQ - Character Properties, Case Mappings and Names</title> <link rel="stylesheet" href="https://www.unicode.org/webscripts/standard_styles.css"> <link rel="stylesheet" href="faq_styles_5.css"> </head> <body> <!-- BEGIN HEADER BAR --> <header> <nav> <a href="https://www.unicode.org/main.html">Tech Site</a> | <a href="https://www.unicode.org/sitemap/">Site Map</a> | <a href="https://www.unicode.org/search">Search</a> </nav> <div id="headercore"> <a href="https://www.unicode.org/"><img width="34" height="33" src="images/logo34x33.svg" alt="Unicode"></a> <a href="index.html">Frequently Asked Questions</a> </div> </header> <!-- END HEADER BAR --> <!-- BEGIN CONTENTS --> <main> <h1>Character Properties, Case Mappings &amp; Names FAQ</h1> <nav class="faqtoc"> </nav> <section> <h2><a id="charprop"></a>Character Properties</h2> <p class="q" id="30">Q: Where are Unicode character properties defined?</p> <p>The short answer is: in the <a class="glossarylink" href="https://www.unicode.org/glossary/#unicode_character_database">Unicode Character Database</a> (<a class="glossarylink" href="https://www.unicode.org/glossary/#ucd">UCD</a>).</p> <p>Several <a class="glossarylink" href="https://www.unicode.org/glossary/#unicode_technical_standard">Unicode Technical Standards</a> and <a class="glossarylink" href="https://www.unicode.org/glossary/#unicode_technical_report">Unicode Technical Reports</a> also define their own <a class="glossarylink" href="https://www.unicode.org/glossary/#property">properties</a>, which are listed separately. There is also the large collection of data specifically for Unified <a class="glossarylink" href="https://www.unicode.org/glossary/#ideograph">ideographs</a>, called the “Unihan” Database, which forms a separate subset of the Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#character_properties">character properties</a>. It's structure and contents are significantly different so that it isn't generally included when talking about the “UCD”. <a href="https://www.unicode.org/faq/attribution.html#AF">[AF]</a></p> <p class="q" id="31">Q: What is the Unicode Character Database?</p> <p>The <a href="https://www.unicode.org/ucd/">Unicode Character Database</a> (<a class="glossarylink" href="https://www.unicode.org/glossary/#ucd">UCD</a>) is a collection of <a class="glossarylink" href="https://www.unicode.org/glossary/#plain_text">plain text</a> files, updated for every release of the Unicode Standard. Those plain text files contain information about the <a class="glossarylink" href="https://www.unicode.org/glossary/#property">properties</a> of every Unicode character. All files for the most up-to-date version of the UCD are always located at <a href="https://www.unicode.org/Public/UCD/latest/">https://www.unicode.org/Public/UCD/latest/</a> on the Unicode website. This location also includes the <a class="glossarylink" href="https://www.unicode.org/glossary/#unihan">Unihan</a> database.</p> <p class="q" id="32">Q: Where can I find documentation for the Unicode Character Database?</p> <p>The details can be found in <a href="https://www.unicode.org/reports/tr44/">UAX #44, Unicode Character Database.</a> (The <a class="glossarylink" href="https://www.unicode.org/glossary/#unihan">Unihan</a> database is documented in <a href="https://www.unicode.org/reports/tr38/">UAX #38, The Unicode Han Database (Unihan).</a>) See also the <a href="ucd.html">FAQ page</a>.</p> <p class="q" id="33">Q: Is there a database query interface for the UCD on the Unicode website?</p> <p>No. The <a class="glossarylink" href="https://www.unicode.org/glossary/#ucd">UCD</a> consists of multiple <a class="glossarylink" href="https://www.unicode.org/glossary/#plain_text">plain text</a> files containing raw <a class="glossarylink" href="https://www.unicode.org/glossary/#property">property</a> data. Those files are suitable for conversion to database formats, import to spreadsheets, conversion to tables, or whatever format may be appropriate for a particular implementer's needs. They are not stored in an RDMS, and the Unicode website does not support a front end for an arbitrary database query.</p> <p>There are other websites which do present simple front ends for database queries for some Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#character_properties">character properties</a>. See, for example: <a href="https://www.fileformat.info/info/unicode/index.htm">https://www.fileformat.info<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>.</p> <p>The subset of character properties related to Chinese characters (<a class="glossarylink" href="https://www.unicode.org/glossary/#CJK">CJK</a>) is a special case. The Unicode website <i>does</i> have a web database query interface for those character properties. See the <a href="https://www.unicode.org/charts/unihan.html">Unihan Database</a>.</p> <p class="q" id="34">Q: Are there other Unicode tools for working with character properties?</p> <p>Yes. The <a href="https://util.unicode.org/">Unicode Utilities subsite</a> also implements a front end with a number of useful utilities for querying characters. One tool allows the input of arbitrary sets of characters using the UnicodeSet format, and shows the resulting explicit list as output. See, for example: <a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp">https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp</a>.</p> <p class="q" id="17">Q: Are any unassigned characters or reserved characters given default properties?</p> <p>Default values are defined for all <a class="glossarylink" href="https://www.unicode.org/glossary/#character_properties">character properties</a>. For a discussion of how this works and details about particular default values for <a class="glossarylink" href="https://www.unicode.org/glossary/#property">properties</a>, see <a href="https://www.unicode.org/reports/tr44/#Default_Values">UAX #44, Unicode Character Database</a>.</p> <p class="q" id="18">Q: Unicode now treats the <span class="name">SOFT HYPHEN</span> as format control (Cf) character when formerly it was a punctuation character (Pd). Doesn't this break ISO 8859-1 compatibility?</p> <p>No. The ISO 8859-1 standard defines the <span class="name">SOFT HYPHEN</span> as "[a] <a class="glossarylink" href="https://www.unicode.org/glossary/#graphic_character">graphic character</a> that is imaged by a graphic symbol identical with, or similar to, that representing hyphen" (section 6.3.3), but does not specify details of how or when it is to be displayed, nor other details of its semantics. The soft hyphen has had a long history of legacy implementation in two or more incompatible ways.</p> <p>Unicode clarifies the semantics of this character for Unicode implementations, but this does not affect its usage in ISO 8859-1 implementations. Processes that convert back and forth may need to pay attention to semantic differences between the standards, just as for any other character.</p> <p>In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the <span class="name">SOFT HYPHEN</span> as a hyphen in all circumstances. The change in semantics of the Unicode character does not require that implementations of terminal emulators in other environments, such as ISO 8859-1, make any change in their current behavior.</p> <p class="q" id="19">Q: Where can I find the numerical values of characters with the hexadecimal digit (Hex_Digit) property?</p> <p>The Unicode Standard provides the Hex_Digit <a class="glossarylink" href="https://www.unicode.org/glossary/#property">property</a>, which specifies which characters are hexadecimal <a class="glossarylink" href="https://www.unicode.org/glossary/#digits">digits</a>: 0-9, A-F, a-f, and their <a class="glossarylink" href="https://www.unicode.org/glossary/#fullwidth">fullwidth</a> equivalents. (The ASCII_Hex_Digit property specifies the intersection of the Hex_Digit property and the Basic Latin <a class="glossarylink" href="https://www.unicode.org/glossary/#block">block</a>.) There is no table in the <a class="glossarylink" href="https://www.unicode.org/glossary/#ucd">UCD</a> mapping the hexadecimal digit characters to their values, analogous to the Numeric_Value property. <a href="https://www.unicode.org/faq/hex-digit-values.txt">The table linked here</a> removes this real, if trivial, gap. <a href="https://www.unicode.org/faq/attribution.html#JC">[JC]</a></p> <p class="q" id="20">Q: How does Unicode cope with hexadecimal digits?</p> <p>The hexadecimal number system, used in computing, is not that special: you can base a number system on any natural number except the number 1. The most widely used base is 10, but 2, 8, and 12 have also seen extensive use as number bases, whether in computing or archaic mathematics. Hence, it is not wise to define a particular set of <a class="glossarylink" href="https://www.unicode.org/glossary/#digits">digits</a> for every number system somebody might wish to apply.</p> <p>Rather, the Unicode character encoding, much like its predecessors, assumes that hexadecimal numbers be written with the ordinary (decimal) digits (representing zero through nine), and the letters A through F (representing ten to fifteen). Only from context, it becomes clear whether a string of digits is to be meant as a number, and if so, in which number system.</p> <p>Most applications have defined particular syntax rules to help distinguishing decimal, octal, or hexadecimal numbers from other input tokens, e. g., in some programming languages, “2010” is a decimal number, “0x7DA” is a hexadecimal number, “thisYear” is an identifier. In absence of such syntactical hints, you could peruse the Hex_Digit <a class="glossarylink" href="https://www.unicode.org/glossary/#property">property</a> from the <a class="glossarylink" href="https://www.unicode.org/glossary/#unicode_character_database">Unicode Character Database</a> to identify hexadecimal numbers; however, a string of Hex_Digit characters, such as “bed”, is not necessarily meant to be read as a hexadecimal number.</p> <p>Whenever it is important that hexadecimal numbers in a table align vertically, you should choose a fixed-pitch <a class="glossarylink" href="https://www.unicode.org/glossary/#font">font</a> for them by means of a <a class="glossarylink" href="https://www.unicode.org/glossary/#higher_level_protocol">higher-level protocol</a>. Some fonts will also show the <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> hexadecimal digits at the same height as the digits. Such a font is used in the Unicode code charts to give 4- and 5-digit hexadecimal numbers a nice rectangular appearance.</p> <p class="q" id="14">Q: Where are private-use characters used, and how should they be handled?</p> <p>This is the topic of the <a href="https://www.unicode.org/faq/private_use.html#privateuse">Private-Use Characters FAQ</a>, which answers many questions about the handling of <a class="glossarylink" href="https://www.unicode.org/glossary/#private_use_character">private-use characters</a>.</p> </section> <section> <h2><a id="casemap"></a>Case Mapping</h2> <p class="q" id="1">Q: Where can I find the Unicode case mapping information?</p> <p>The <a href="https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"> UnicodeData.txt</a> file includes all of the one-to-one <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mappings</a>. Since many parsers were built with the expectation that UnicodeData.txt would have at most a single character in each case mapping field, the file <a href="https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt">SpecialCasing.txt</a> was added to provide information on exceptional one-to-many mappings, such as the one needed for uppercasing ß (U+00DF <span class="name">LATIN SMALL LETTER SHARP S</span>). In addition, <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a> contains additional mappings used in <a class="glossarylink" href="https://www.unicode.org/glossary/#case_folding">case folding</a> and caseless matching. For more information, see <a href="https://www.unicode.org/versions/latest/core-spec/#G21180">Section 5.18, Case Mappings</a> in <em>The Unicode Standard</em>.</p> <p class="q" id="2">Q: What is the difference between case mapping and case folding?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">Case mapping</a> or case conversion is a process whereby strings are converted to a particular form &mdash; <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a>, <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a>, or <a class="glossarylink" href="https://www.unicode.org/glossary/#titlecase">titlecase</a> &mdash; possibly for display to the user. <a class="glossarylink" href="https://www.unicode.org/glossary/#case_folding">Case folding</a> is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is primarily based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.</p> <p class="q" id="3">Q: Which scripts have an uppercase and a lowercase?</p> <p>The most widely used modern scripts with case are Latin, Greek, Armenian and Cyrillic. In addition there are a few historic or archaic scripts that have case. The vast majority of scripts, modern or archaic, do not have case distinctions.</p> <p class="q" id="4">Q: What is titlecase? How is it different from uppercase?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#titlecase">Titlecase</a> takes its name from the case format used when forming a title, in which the initial letter in a word is capitalized and the rest are not. Titlecase is also used in forming a sentence by capitalizing the first word, and for forming proper names. The titlecase mapping in the Unicode Standard is the mapping applied to the initial character in a word.</p> <p>The titlecase mapping in Unicode differs from the <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> mapping in that a number of characters require special handling. These are chiefly <a class="glossarylink" href="https://www.unicode.org/glossary/#ligature">ligatures</a> and <a class="glossarylink" href="https://www.unicode.org/glossary/#digraph">digraphs</a> such as 'fl', 'dz', and 'lj', plus a number of <a class="glossarylink" href="https://www.unicode.org/glossary/#polytonic">polytonic</a> Greek characters. For example, U+01C7 (LJ) maps to U+01C8 (Lj) rather than to U+01C9 (lj).</p> <p class="q" id="5">Q: Does the default case mapping work for every language? What about the default case folding?</p> <p>The Unicode Standard defines the default <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mapping</a> for each individual character, with each character considered in isolation. This mapping does not provide for the context in which the character appears, nor for the language-specific rules that must be applied when working in natural language text.</p> <p>By contrast, <a class="glossarylink" href="https://www.unicode.org/glossary/#case_folding">case folding</a>, which is primarily based on the <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> mapping, is intended to be language-neutral. Since the case folding rules do not vary by language or context, this makes them unsuitable as the basis for displaying or transforming text for human consumption.</p> <p>To make case mapping language sensitive, the Unicode Standard specificially allows implementations to tailor the mappings for each language, but does not provide the necessary data. The file <a href="https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt">SpecialCasing.txt</a> is included in the Standard as a guide to a few of the more important individual character mappings needed for specific languages, notably the Greek script and the Turkic languages. However, for most language-specific mappings and tailoring, users should refer to <a href="http://cldr.unicode.org">CLDR</a> and other resources.</p> <p class="q" id="6">Q: What is 'tailoring' and how might it affect case mapping?</p> <p>Tailoring is the modification of the <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mapping</a> rules to meet the specific needs of a given language, culture, or orthography. For example, while the default <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> mapping of “a” is “A” and the default mapping of “à” is “À”, the uppercase conversion of “je vais à Paris” in some forms of French might be “JE VAIS A PARIS”. Notice how the “à” is uppercased as "A" in this case.</p> <p>Similarly, in English, one of Proust's novels is rendered in <a class="glossarylink" href="https://www.unicode.org/glossary/#titlecase">titlecase</a> as “<em>In Search of Lost Time</em>”. Notice that the 'o' in 'of' is not capitalized, although the remainder of the words follow the Unicode Standard's definition of titlecase: this is an English-specific tailoring of titlecase. The original French title of this work is rendered in titlecase as “<em>À la recherche du temps perdu</em>”. Here, only the first word is in the default titlecase; the others follow rules specific to a particular French convention.</p> <p class="q" id="6a">Q: Why isn't there an “Ij” character encoded to serve as the titlecase for U+0132 <span class="name">LATIN CAPITAL LETTER IJ</span> and U+0133 <span class="name">LATIN SMALL LETTER IJ</span>?</p> <p>The Unicode Standard encodes these two <a class="glossarylink" href="https://www.unicode.org/glossary/#compatibility_character">compatibility characters</a> to provide support for roundtrip conversion of the Dutch letter 'ij' in certain very rare legacy (non-Unicode) character encodings. It is strongly preferred (and far more common) to use the two character <a class="glossarylink" href="https://www.unicode.org/glossary/#ASCII">ASCII</a> sequence 'ij' to represent this letter instead.</p> <p>In Dutch, the letter 'ij' behaves like the other single letters, so the correct <a class="glossarylink" href="https://www.unicode.org/glossary/#titlecase">titlecase</a> mapping of U+0133 (ij) is U+0132 (so a word such as “ijsje” titlecases as “IJsje”). That is, the titlecase mapping for both of these characters is U+0132 and no additional character is needed.</p> <p class="q" id="7">Q: Are case mappings for words or text runs reversible?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">Case mapping</a> loses information and thus does not allow for a round trip. For example, when the string “Mark” is lowercased, the original form cannot be recovered; it might have been “mark” or “MARK” instead. Some strings contain contextual case distinctions that are not preserved by case mapping. Consider the English word “anglo-American”, the Italian word “vederLa”, or the German words “haben” and “Haben”. Once you <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a>, <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> or <a class="glossarylink" href="https://www.unicode.org/glossary/#titlecase">titlecase</a> these strings, you can&#39;t recover the original just by performing the reverse operation.</p> <p class="q" id="7a">Q: Are case mappings for individual characters reversible?</p> <p>Many of the individual character <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mappings</a> cannot be reversed. For example:</p> <ul> <li>Some characters have multiple characters that map to them. For example, in the Greek script, capital sigma (U+03A3) is the <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> form of both the regular (U+03C2) and final (U+03C3) <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> sigma.</li> <li>Some character mappings result in a <a class="glossarylink" href="https://www.unicode.org/glossary/#decomposition">decomposition</a>. For example, the uppercase mapping of the 'fl' <a class="glossarylink" href="https://www.unicode.org/glossary/#ligature">ligature</a> (U+FB02 <span class="name">LATIN SMALL LIGATURE FL</span>) maps to 'F' followed by 'L'.</li> <li>Some case mappings depend on language or locale. For example, in Turkish, the lowercase letter 'i' maps to an uppercase dotted I (U+0130 <span class="name">LATIN CAPITAL LETTER I WITH DOT ABOVE</span>), while the uppercase letter 'I' maps to the dotless lowercase i (U+0131).</li> </ul> <p class="q" id="8">Q: Does uppercasing of a string eliminate all of the lowercase letters in it?</p> <p>No. Some letters (notably those in the <a class="glossarylink" href="https://www.unicode.org/glossary/#IPA">IPA</a> <a class="glossarylink" href="https://www.unicode.org/glossary/#block">block</a>) have no matching case equivalent. As a result, uppercasing a string may not eliminate all of the <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> letters in it.</p> <p class="q" id="10">Q: Why is there no unique uppercase character for ſ — U+017F <span class="name">LATIN SMALL LETTER LONG S</span> (and about one hundred other characters)?</p> <p>There are over 100 <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> letters in the Unicode Standard that have no direct <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> equivalent. For example, the uppercase form for <em>long s</em> is an ordinary capital S. Another example would be U+0237 <span class="name">LATIN SMALL LETTER DOTLESS J</span>: the capital J is already dotless, so an extra letter isn't needed as an uppercase mapping. Some of the other characters with no uppercase equivalent are <a class="glossarylink" href="https://www.unicode.org/glossary/#compatibility_character">compatibility characters</a>. Many of these, such as 'fl' (U+FB02 <span class="name">LATIN SMALL LIGATURE FL</span>), decompose to two or more characters when casing is applied. Finally, others are characters that are only used in lowercase, such as many characters used for <a class="glossarylink" href="https://www.unicode.org/glossary/#IPA">IPA</a> and other phonetic systems. Text in IPA, like that in many other phonetic systems should never be case converted, even those IPA characters that do have an uppercase equivalent.</p> <p class="q" id="9">Q: Why aren&#39;t there extra characters encoded to support locale-independent casing for Turkish?</p> <p>The Turkish language, like other Turkic languages, distinguishes a dotted letter 'i' from a dotless letter 'ı' (U+0131 <span class="name">LATIN SMALL LETTER DOTLESS I</span>). In these languages, each has an equivalent <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> mapping: U+0131 maps to the ordinary letter 'I', while 'i' maps to U+0130 (<span class="name">LATIN CAPITAL LETTER I WITH DOT ABOVE</span>).</p> <p>Historically, users generally did not distinguish between the <a class="glossarylink" href="https://www.unicode.org/glossary/#ASCII">ASCII</a> letters and their Turkish equivalents, so legacy character encodings, such as ISO 8859-9, which support the Turkic languages, did not separately encode characters to serve as the basis for locale-independent casing. These character encodings are often used for both Turkish and non-Turkish text. <a class="glossarylink" href="https://www.unicode.org/glossary/#transcoding">Transcoding</a> this data to Unicode would be intolerably difficult if users had to somehow identify which 0x49 characters (for example) were ordinary “I” and which were <span class="name">LATIN CAPITAL LETTER DOTLESS I</span>. In addition, because users are not used to making the distinction, it is unlikely that they would input the “correct” additional letters, even if they existed.</p> <p class="q" id="11">Q: Why does ß (U+00DF <span class="name">LATIN SMALL LETTER SHARP S</span>) not uppercase to U+1E9E <span class="name">LATIN CAPITAL LETTER SHARP S</span> by default?</p> <p>In standard German orthography, the <em>sharp s</em> (”ß”) used to be exclusively uppercased to a sequence of two capital S characters. This longstanding practice is reflected in the default <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mappings</a> in Unicode. A capital form of ß is sometimes preferred for typographic reasons or to avoid ambiguity, such as in <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> names as found in passports. It is encoded in the Unicode Standard as U+1E9E. While this character is not widely used, is now recognized in the official orthography as an optional uppercase form of ß in addition to “SS”. Because it is only an optional alternative, the original mapping to “SS” is retained in the Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#character_properties">character properties</a>.</p> <p class="q" id="12">Q: Why does the Greek letter sigma require special handling?</p> <p>Near the end of the SpecialCasing.txt, there are two lines that are commented out pertaining to the Greek letter sigma. At first glance, they may look a bit odd:</p> <pre> # 03C3; 03C2; 03A3; 03A3; FINAL; # GREEK SMALL LETTER SIGMA # 03C2; 03C3; 03A3; 03A3; NON_FINAL; # GREEK SMALL LETTER FINAL SIGMA </pre> <p>Both of these lines refer to conditional <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mappings</a> (column 5). In normal Greek text, a U+03C3 (non-final sigma) should be written as U+03C2 (final sigma) if it is at the end of a word, and a U+03C2 (final sigma) should be written as a U+03C3 (non-final sigma) if it is not at the end of a word. This is what these two lines would mean if they were uncommented. The reason that they <em>are</em> commented out is that the SpecialCasing file is not intended to normalize the appearance of a <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> sigma.</p> <p class="q" id="13">Q: Is case folding stable between Unicode versions?</p> <p>Any string that is case-folded according to the rules in Version 5.0 or later is guaranteed to still be case-folded according to the rules for any subsequent version of the Unicode Standard. For the formal statement of that stability guarantee, see the <a href=" https://www.unicode.org/policies/stability_policy.html#Case_Folding">Case Folding Stability Policy</a>.</p> <p class="q" id="13a">Q: Does case folding stability prevent the encoding of new case pairs?</p> <p>For a newly encoded <a class="glossarylink" href="https://www.unicode.org/glossary/#bicameral">bicameral</a> (cased) script or for completely new case pairs, there are no restrictions that result from <a class="glossarylink" href="https://www.unicode.org/glossary/#case_folding">case folding</a> stability. Because such scripts or characters had not yet been encoded in earlier versions of the standard, they also had no case folding yet defined for them. New scripts or completely new case pairs can be added freely in future versions.</p> <p class="q" id="13b">Q: Can a case pair be added if one of the pair is already encoded?</p> <p>The usual situation is to add a new <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> letter intended to have a <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mapping</a> to an existing <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> letter that had no case pair before. Because <a class="glossarylink" href="https://www.unicode.org/glossary/#case_folding">case folding</a> is primarily based on the lowercase mapping, adding a new uppercase letter like this is fine&#x2014;the case folding will be specified as mapping to the existing lowercase letter, and case folding stays stable.</p> <p class="q" id="13c">Q: What happens if the <i>uppercase</i> letter is the one that is already encoded?</p> <p>That situation is more complicated. When the existing encoded letter is an <i><a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a></i> letter and the proposal is to encode a new <i><a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a></i> letter case pair for it, that is normally disallowed. The <a class="glossarylink" href="https://www.unicode.org/glossary/#case_folding">case folding</a> for the existing uppercase letter would change, and that is blocked by the requirement for case folding stability. In exceptional situations, if a lowercase letter must be added, it would need to be case-folded to the existing <i>uppercase</i> letter, rather than changing the case folding for that existing letter. Such an exceptional situation did, in fact, apply for the addition of Cherokee lowercase <a class="glossarylink" href="https://www.unicode.org/glossary/#syllable">syllables</a> in Version 8.0. Cherokee case folding rules were specified to map to the old <i>uppercase</i> syllables, to preserve case folding stability for them.</p> <p class="q" id="13d">Q: What about the situation where both characters are already encoded, but should be case-folded together?</p> <p>Changing an existing character to case-fold to a different character is prohibited, for stability, so this cannot be done.</p> <p class="q" id="13e">Q: Why are U+2126 <span class="name">OHM SIGN</span> and U+00B5 <span class="name">MICRO SIGN</span> cased like omega and mu?</p> <p>When the text "Resistance is 950μΩ" is subjected to some of the CSS text transforms, it is displayed as:</p> <ul> <li>text-transform: <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a>: RESISTANCE IS 950ΜΩ</li> <li>text-transform: <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a>: resistance is 950μω</li> <li>text-transform: capitalize: Resistance Is 950μΩ</li> </ul> <p>On the face of it, that seems undesirable, since it changes the meaning of the text. This raises the question why U+2126 <span class="name">OHM SIGN</span> and U+00B5 <span class="name">MICRO SIGN</span> were not classified as "Symbol, Other", and not assigned case equivalents.</p> <p>The Unicode Standard does not guarantee that transforming text (with the exception of <a class="glossarylink" href="https://www.unicode.org/glossary/#normalization">normalization</a>) will not affect its meaning. <a class="glossarylink" href="https://www.unicode.org/glossary/#ASCII">ASCII</a> letters used for <a class="glossarylink" href="https://www.unicode.org/glossary/#SI_unit">SI units</a> are not exempt from casing, and also change meaning with case: 1ms = 1 millisecond, whereas 1MS = 1 megasiemens. More generally, applying <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mappings</a> to technical text rather than “normal language” is a mistake, and cannot be fixed in the encoding nor via <a class="glossarylink" href="https://www.unicode.org/glossary/#property">properties</a>. Further, case mappings are lossy even on normal text (lowercasing iPod or McGowan; or any noun in German; uppercasing Irish).</p> <p>Not all letters for SI units have duplicates, this is the reason why the few that were introduced separately have been made canonically equivalent to standard letters. In particular, μ and Ω normalize to their standard Greek counterparts, which means that treating them differently is not possible. This way, all SI units are treated the same.</p> </section> <section> <h2><a id="nameprop"></a>Character Names</h2> <p class="q" id="22">Q: What are character names for and why are some characters named in unusual ways?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#character_name">Character names</a> are defined so that a mnemonic string can be used to uniquely identify a character, rather than representing it with just a hexadecimal code. Characters can have multiple uses or multiple common names, so a single identifier cannot provide a natural name for all users and all purposes. Sometimes, names are deliberately chosen to describe the appearance of a character, rather than its meaning or function, because the character is used in many competing contexts. Such use of descriptive names is particularly common for symbols.</p> <p>Because characters names are identifiers, there are some additional restrictions and conventions, which govern the way they are assigned and provide some uniformity in naming. In many instances, descriptive comments and <a class="glossarylink" href="https://www.unicode.org/glossary/#informative">informative</a> aliases are added to the listing of the character names in the code charts to make it easier for users to select the right character for the right purpose.</p> <p class="q" id="23">Q: Can the name of a character be changed to better reflect the way it is used?</p> <p>Once a <a class="glossarylink" href="https://www.unicode.org/glossary/#character_name">character name</a> has been given, it cannot be changed. Because names are identifiers, for which stability is very important, the <a href="https://www.unicode.org/policies/stability_policy.html">Unicode Character Encoding Stability Policy</a> explicitly prevents character names from being changed. Character names, however, can be annotated in the code charts. For example, U+0674 <span class="name">ARABIC LETTER HIGH HAMZA</span> is annotated as being used for Kazakh, not Arabic. For outright erroneous names, a formal alias may be provided (in <a href="https://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt">NameAliases.txt</a>), that gives a corrected alias for the character. This alias can be used anywhere a character name can be used, but it does not replace the actual name. In limited cases, a widely used alternate name or a common abbreviation may likewise be given as an alias.</p> <p class="q" id="24">Q: Should I be concerned if the name of a script, block or character doesn't reflect the way it is used?</p> <p>Script, <a class="glossarylink" href="https://www.unicode.org/glossary/#block">block</a>, and <a class="glossarylink" href="https://www.unicode.org/glossary/#character_name">character names</a> are used by Unicode solely as identifiers; that is, their purpose is to <i>distinguish</i> entities and not to <i>describe</i> them. Changes to these names create extensive interoperability and backward <a class="glossarylink" href="https://www.unicode.org/glossary/#compatibility">compatibility</a> issues. There is usually a relationship between the name of a block, the name of the script that uses characters in that block, and the names of the characters themselves in order to ease identification.</p> <p>The use of a particular name as an identifier for a script in the Unicode Standard does not imply an endorsement of that name as the best alternative for general use. The <a class="glossarylink" href="https://www.unicode.org/glossary/#unicode_consortium">Unicode Consortium</a> does not make recommendations on how to refer to scripts in other contexts.</p> <p class="q" id="25">Q: How are script and block names related to character names?</p> <p>Many <a class="glossarylink" href="https://www.unicode.org/glossary/#character_name">character names</a> contain a <a class="glossarylink" href="https://www.unicode.org/glossary/#script_designator">script designator</a>. For example, many characters in the Telugu script contain the word "<span class="name">TELUGU</span>" in the first part of their names. This script designator is based on the name of the script, in this case "Telugu". For consistency, the script name is also reflected in <a class="glossarylink" href="https://www.unicode.org/glossary/#block">block</a> names, whenever blocks contain characters primarily of one script.</p> <p class="q" id="26">Q: What are the script names in the Unicode Standard based on?</p> <p>In nearly all cases, the script names are based on common English usage. When there are important alternative names for scripts, they are often provided as <a class="glossarylink" href="https://www.unicode.org/glossary/#annotation">annotations</a> in the code charts and documentation. For example, the New Tai Lue script is referred to in China as Xishuang Banna Dai, which is listed as an alternative name in the <a href="https://www.unicode.org/charts/">code charts</a>. The local name for a script may differ from English usage. Translated versions of the <a class="glossarylink" href="https://www.unicode.org/glossary/#character_name">character names</a> list would use translations of the script names and designators and follow local usage.</p> <p class="q" id="27">Q: Can I determine the script of a character by the character or block name?</p> <p>No, not at all. The <a class="glossarylink" href="https://www.unicode.org/glossary/#character_name">character names</a> and <a class="glossarylink" href="https://www.unicode.org/glossary/#block">block</a> names are not reliable indicators of the script of a character. The <a class="glossarylink" href="https://www.unicode.org/glossary/#script_property">Script property</a> should be used instead to determine the script of any particular character. For example, as of Unicode 6.0 there were the following mismatches between Script property value and character or block names for Latin and Greek:</p> <ul> <li>149 characters have the Latin Script property value, but do not have “<span class="name">LATIN</span>” in their character names.</li> <li>280 characters that have “<span class="name">LATIN</span>” in their character names do not have the Latin Script property value.</li> <li>17 characters have the Greek Script property value, but are not in blocks that have "Greek" in the block name.</li> <li>66 characters that are in blocks that have “Greek” in the block name do not have the Greek Script property value.</li> </ul> <p>For more information, see <a href=" https://www.unicode.org/reports/tr24/">UAX #24: Unicode Script Property</a>.</p> <p class="q" id="28">Q: Are there any tools available to convert character values to character names, or to tell me the script of a character?</p> <p>Yes, there are several such tools listed on the <a href="https://www.unicode.org/resources/online-tools.html">Online Tools</a> page of the Unicode site. Here are a few you might like to try.</p> <p>Web based lookup:</p> <ul> <li>Mark Davis&#39; <a href="https://util.unicode.org/UnicodeJsps/">CLDR tools</a> handle many kinds of conversions, etc.</li> <li>Richard Ishida&#39;s <a href="https://r12a.github.io/app-analysestring/">String Analyser<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> tells you the name and <a class="glossarylink" href="https://www.unicode.org/glossary/#block">block</a> of one or more characters. If you are starting with character codes, use the “View Names” button above the “Characters” text box in his <a href="https://r12a.github.io/app-conversion/">Unicode Code Converter<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> or use <a href="https://r12a.github.io/uniview/">UniView.<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a></li> <li>Andrew West’s <a href="http://www.babelstone.co.uk/Unicode/whatisit.html">What Unicode character is this?<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> javascript tool converts Unicode characters to Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#character_name">character names</a>.</li> <li>Johannes Bergerhausen&#39;s <a href="http://www.decodeunicode.org/">DecodeUnicode</a> allows you to look up information about characters.</li> </ul> <p>Downloadable application code:</p> <ul> <li>The uniname utility in Bill Poser’s <a href="http://billposer.org/Software/unidesc.html">unidesc<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> package tells you the name of a character.</li> <li>Tom Christiansen&#39;s <a href="http://search.cpan.org/~bdfoy/Unicode-Tussle-1.03/lib/Unicode/Tussle.pm">Unicode::Tussle<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> (<a href="https://github.com/briandfoy/Unicode-Tussle">download<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>) is a distribution of curious and various Unicode utilities.</li> <li>The orphaned <a href="https://bitbucket.org/jsbien/unihistext">program<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> of a student of Janusz S. Bien also handles named sequences.</li> </ul> <p>A simple standard Perl program may be what you want, for example to view the name of character U+1234:</p> <pre> <code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$ perl -e &#39;use charnames();print charnames::viacode(0x1234),”\n”&#39;</code></pre> <p>See the <a href="https://www.unicode.org/resources/online-tools.html">Online Tools</a> page for more links.</p> <p class="q" id="15">Q: The character name alias for the control character U+0082 is <span class="name">BREAK PERMITTED HERE</span>. Does that mean I have to interpret that control character in that way?</p> <p>The Unicode Standard does not define U+0082 to mean “<span class="name">BREAK PERMITTED HERE</span>”. Formally this character is simply one of 65 <a class="glossarylink" href="https://www.unicode.org/glossary/#control_codes">control codes</a>, one which in ISO 6429 has the name and meaning of “<span class="name">BREAK PERMITTED HERE</span>”. Implementers of the Unicode Standard are not required to interpret the character U+0082 in accordance with ISO 6429 (or to interpret it at all).</p> <p>The standard does assign particular <a class="glossarylink" href="https://www.unicode.org/glossary/#property">properties</a> and semantics for certain controls commonly used in text files including tab, carriage return, line feed, form feed, and next line. However, it does not give the majority of control codes any semantics at all; that is left to a <a class="glossarylink" href="https://www.unicode.org/glossary/#higher_level_protocol">higher-level protocol</a>.</p> <p>The character <i>names</i> for control characters are actually undefined, however, name aliases, such as “<span class="name">BREAK PERMITTED HERE</span>” have been defined. These aliases are based on ISO 6429, and can be used to identify specific controls, for example in regular expressions. For other control characters see <a href="https://www.unicode.org/charts/PDF/U0080.pdf">https://www.unicode.org/charts/PDF/U0080.pdf</a>.</p> <p class="q" id="16">Q: Where can I find formal definitions of the terms used in character names? In particular designations like “turned”, “inverse”, “inverted”, “reversed”, “rotated”.</p> <p>These terms are basically typographical rather than Unicode-specific.</p> <p>A <em>turned character</em> is one that has been rotated 180 degrees around its center. A turned “e” winds up with the opening in the upper left portion. U+0259 <span class="name">LATIN SMALL LETTER SCHWA</span> is a turned “e”.</p> <p>An <em>inverted character</em> has been flipped along the horizontal axis. An inverted “e” winds up with the opening in the upper right portion. There is no Unicode character representing an inverted “e”.</p> <p>A <em>reversed character</em> has been flipped along the vertical axis. A reversed “e” winds up with the opening in the lower left portion. U+0258 <span class="name">LATIN SMALL LETTER REVERSED E</span> is a reversed “e”.</p> <p>A <em>rotated character</em> has been rotated 90 degrees, but one cannot tell which way without looking at the <a class="glossarylink" href="https://www.unicode.org/glossary/#glyph">glyph</a>. U+213A <span class="name">ROTATED CAPITAL Q</span> is a “Q” that has been rotated counterclockwise.</p> <p><em>Inverse</em> means that the white parts of the glyph are made black, and vice versa. An inverse “e” looks like a normal “e” but is white on a black background. There is no Unicode character representing an inverse “e”. <a href="https://www.unicode.org/faq/attribution.html#JC">[JC]</a></p> <p class="q" id="21">Q: Why is the hacek accent called “caron” in Unicode?</p> <p>Nobody knows.</p> <p>Legend has it that the term was first spotted in one of the &#39;giant books&#39; from the 1930s at Mergenthaler Linotype Company in Brooklyn, NY, but no one has been able to confirm that.</p> <p>More accurate reports trace the term back to the mid 1980s where we do have documented sightings of “caron” in publications such as:</p> <ul> <li>The TypEncyclopedia by Frank Romano, ISBN: 0-835-21925-9, Libraries Unlimited; 1984<br>p. 6 shows the mark with the notation “caron/hacek/clicka”</li> <li>IBM&#39;s Green Book which has an original copyright date of 1986.<br>“Caron Accent” appears on p. K-432, in a table entitled “Diacritic Mark Special <a class="glossarylink" href="https://www.unicode.org/glossary/#graphic_character">Graphic Characters</a>.”<br>National Language Support Reference Manual. 4th ed. 1994. (National Language Design Guide, 2)</li> <li><a class="glossarylink" href="https://www.unicode.org/glossary/#SGML">SGML</a> documentation from ISO 8879:1986, see <a href="http://www.w3.org/2003/entities/iso8879doc/isodia.html">isodia.html<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a></li> </ul> <p>Unicode and the ISO 8859 series of standards just carried the tradition along.</p> <p>In an article published in 2001: “<a href="http://www.phon.ucl.ac.uk/home/wells/dia/diacritics-revised.htm">Orthographic diacritics and multilingual computing<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>”, J.C. Wells — a linguist at the University College in London — writes:</p> <blockquote> <p>“The term ‘caron’, however, is wrapped in mystery. Incredibly, it seems to appear in no current dictionary of English, not even the OED.”</p> </blockquote> <p>Whoever the originators were, we suspect that they have probably taken their secrets to the grave by now.</p> </section> </main> <!-- END CONTENTS --> <footer> <hr> <div><a href="https://www.unicode.org/copyright.html"> <img src="https://www.unicode.org/img/hb_notice.gif" alt="Access to Copyright and terms of use"></a></div> </footer> <script src="toc.js"></script> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10