CINXE.COM

FAQ - International Domain Names (IDN)

<!doctype html> <html lang="en-us"> <head> <meta charset="utf-8"> <meta content="initial-scale=1.0" name="viewport"> <meta name="keywords" content="IDN, Domain Name"> <meta name="description" content="International Domain Names (IDN)"> <title>FAQ - International Domain Names (IDN)</title> <link rel="stylesheet" href="https://www.unicode.org/webscripts/standard_styles.css"> <link rel="stylesheet" href="faq_styles_5.css"> <style> .linkstyle { font-family: "Courier New", Courier, monospace; text-decoration: underline; font-size: 90%; vertical-align:top; } .footnote { font-size: 90%; line-height: 100%; margin-bottom: 0; } .footnote+.footnote { margin-top: 0; padding-top: 0; } </style> </head> <body> <!-- BEGIN HEADER BAR --> <header> <nav> <a href="https://www.unicode.org/main.html">Tech Site</a> | <a href="https://www.unicode.org/sitemap/">Site Map</a> | <a href="https://www.unicode.org/search">Search</a> </nav> <div id="headercore"> <a href="https://www.unicode.org/"><img width="34" height="33" src="images/logo34x33.svg" alt="Unicode"></a> <a href="index.html">Frequently Asked Questions</a> </div> </header> <!-- END HEADER BAR --> <!-- BEGIN CONTENTS --> <main> <h1>Internationalized Domain Names (IDN) FAQ</h1> <nav class="faqtoc"> </nav> <section> <h2><a id="IDN"></a>Internationalized Domain Names</h2> <p class="q" id="2">Q: What is an Internationalized Domain Name (IDN)?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">Domain names</a>, such as &quot;macchiati.blogspot.com&quot;, were originally designed only to support <a class="glossarylink" href="https://www.unicode.org/glossary/#ASCII">ASCII</a> characters. In 2003, the first specification was released that allows most Unicode characters to be used in domain names. This specification was replaced a few years later by <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a>, which differs in some points. <a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a> are supported by all modern browsers and email programs, so people can use links in their native languages, such as <a href="http://Bücher.de">http://Bücher.de<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>.</p> <p class="q" id="3">Q: Do IDNs change the Domain Name System (DNS)?</p> <p>No. Internally, the non-ASCII Unicode characters are transformed into a special sequence of <a class="glossarylink" href="https://www.unicode.org/glossary/#ASCII">ASCII</a> characters. So as far as the DNS system is concerned, all <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain names</a> are just ASCII.</p> <p class="q" id="4">Q: When will IDNs be available?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a> have been defined and in use since 2003, initially under a system called &quot;<a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>&quot;. In 2010 a revised protocol was released as &quot;<a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a>&quot;. At that time, ICANN also began to make <a class="glossarylink" href="https://www.unicode.org/glossary/#internationalized_domain_name">Internationalized Domain Names</a> available for top level domains, like the &quot;org&quot; in &quot;unicode.org&quot;. Since then these can also use non-ASCII characters, for example, &quot;.世界&quot; (.world). Across all domains, over 8 million IDNs had been registered worldwide by <a href="https://idnworldreport.eu/blog/21-aug-06-idn-trends-to-2021/">2020<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>, most of them in Chinese, Latin or Cyrillic.</p> <p class="q" id="5">Q: What is IDNA2008?</p> <p>It is a revision of <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>, approved in 2010. For most Unicode characters it produces the same results as IDNA2003, but there are important classes of characters for which it is not backwards compatible with IDNA2003. ICANN requires the use of <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> in the Root Zone and any <a class="glossarylink" href="https://www.unicode.org/glossary/#top_level_domain">top-level domain</a> which is under contract with ICANN. These <a href="https://www.icann.org/resources/pages/idn-guidelines-2011-09-02-en">guidelines<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> discuss some of the issues for a transition from IDNA2003 to <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a> from the perspective of a <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain name</a> registry.</p> <p class="q" id="6">Q: How does IDNA2008 differ from IDNA2003?</p> <p>It disallows about eight thousand characters that used to be valid, including all <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> characters, full/half-width variants, symbols, and punctuation. It also interprets four characters differently.</p> <p class="q" id="7">Q: Which four characters are interpreted differently in IDNA2008?</p> <p>Four characters can cause an <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> implementation to go to a different web page than an <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> implementation, given the same source, such as <span class="linkstyle">href=&quot;http://faß.de&quot;</span>. These four characters include some that are quite common in languages such as German, Greek, Farsi, and Sinhala:</p> <blockquote> <p>U+00DF ( ß ) <span class="name">LATIN SMALL LETTER SHARP S</span><br> U+03C2 ( ς ) <span class="name">GREEK SMALL LETTER FINAL SIGMA</span><br> U+200C ( ) <span class="name">ZERO WIDTH NON-JOINER</span><br> U+200D ( ) <span class="name">ZERO WIDTH JOINER</span></p> </blockquote> <p>For the purposes of discussion of differences between <a class="glossarylink" href="https://www.unicode.org/glossary/#idna">IDNA</a> versions, these characters are called &quot;deviations&quot;.</p> <p class="q" id="7a">Q: What characters are valid in IDNA2008?</p> <p>The validity and status of characters under <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> is determined algorithmically from Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#character_properties">character properties</a> (with certain exceptions applied). Unicode provides the <a href="https://www.unicode.org/Public/idna/idna2008derived/">result of this computation </a>for every applicable version of the standard including the currently defined exceptions. If IETF publishes additional exceptions in the future, these will be reflected going forward.</p> </section> <section> <h2><a id="Migration"></a>Migration from IDNA2003</h2> <p class="q" id="8">Q: What is UTS #46?</p> <p><a href="https://www.unicode.org/reports/tr46/">UTS #46: Unicode IDNA Compatibility Processing</a>, also sometimes referred to as "TR46", is a Unicode specification that allows implementations to handle <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain names</a> compatibly during the transition from <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> to <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a>. The title is "Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#idna_compatibility_processing">IDNA Compatibility Processing</a>".</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#uts">UTS</a> #46 also provides a preprocessing specification for mapping that can be used with a standard <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a> implementation.</p> <p class="q" id="9">Q: Is UTS #46 an IETF publication?</p> <p><a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> is an IETF specification, while <a class="glossarylink" href="https://www.unicode.org/glossary/#uts">UTS</a> #46 is a specification of the <a class="glossarylink" href="https://www.unicode.org/glossary/#unicode_consortium">Unicode Consortium</a>.</p> <p class="q" id="10">Q: Why is UTS #46 necessary?</p> <p>Browsers and other client software need to support existing pages, which were constructed under the <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> interpretation of international <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain names</a>. They also need to continue meet their user's expectations, such as being able to type in <a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a> with <a class="glossarylink" href="https://www.unicode.org/glossary/#capital_letter">capital letters</a>, or to use the ideographic period in Japanese or Chinese domain names. In particular, the 4 &quot;deviation&quot; characters have the opportunity to cause significant security and usability problems; they and symbols can be phased out over time, but need some transitional support.</p> <p><a href="https://www.unicode.org/reports/tr46/">UTS #46</a> provides a <a class="glossarylink" href="https://www.unicode.org/glossary/#compatibility">compatibility</a> bridge that allows implementations to handle both IDNA2003 and <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> domain names. For the specification and more background information, see <a href="https://www.unicode.org/reports/tr46/">UTS #46</a>.</p> <p class="q" id="11">Q: What are examples of IRIs where characters behave differently under IDNA2008?</p> <p>Here is a table showing <a class="glossarylink" href="https://www.unicode.org/glossary/#internationalized_domain_name">internationalized domain names</a> in the context of IRIs, illustrating the differences in characters:</p> <table class="faq"> <tr> <th>URL</th> <th>IDNA2003</th> <th>UTS #46</th> <th>IDNA2008</th> <th>Comments </th> </tr> <tr> <td class="linkstyle">href=&quot;<a href="http://öbb.at">http://öbb.at<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>&quot;</td> <td class="works">Valid</td> <td class="works">Valid</td> <td class="works">Valid</td> <td>Simple characters</td> </tr> <tr> <td class="linkstyle">href=&quot;<a href="http://ÖBB.at">http://ÖBB.at<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>&quot;</td> <td class="works">Valid†</td> <td class="works">Valid†</td> <td class="fails">Disallowed</td> <td>Case mapping is not part of IDNA2008</td> </tr> <tr> <td class="linkstyle">href=&quot;<a href="http://√.com">http://√.com<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>&quot;</td> <td class="works">Valid</td> <td class="works">Valid</td> <td class="fails">Disallowed</td> <td>Symbols are disallowed in IDNA2008</td> </tr> <tr> <td class="linkstyle">href=&quot;<a href="http://faß.de">http://faß.de<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>&quot;</td> <td class="works">Valid†</td> <td class="works">Valid†</td> <td class="deviates">Valid</td> <td>Deviation (different resulting IP address in IDNA2008)</td> </tr> <tr> <td class="linkstyle">href=&quot;<a href="http://ԛәлп.com">http://ԛәлп.com<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>&quot;</td> <td class="fails">Valid‡</td> <td class="works">Valid</td> <td class="works">Valid</td> <td>IDNA2003 only allows Unicode 3.2 characters, excluding U+051B (ԛ) <em>cyrillic qa</em></td> </tr> <tr> <td class="linkstyle">href=&quot;<a href="http://Ⱥbby.com">http://Ⱥbby.com<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>&quot;</td> <td class="fails">Valid‡</td> <td class="works">Valid†</td> <td class="fails">Disallowed</td> <td>IDNA2003 only allows Unicode 3.2 characters, excluding U+023A ( Ⱥ ) <em>latin A with stroke</em>; Case mapping is not part of IDNA2008</td> </tr> </table> <blockquote> <p class="footnote"><span class="works">†</span> Mapped to different characters, eg lowercase.</p> <p class="footnote"><span class="fails">‡</span> Note that the Unicode characters after 3.2 were valid on lookup, but not for registration.</p> </blockquote> <p>For a more detailed account of the similarities and differences, with character counts, see <em><a href="https://www.unicode.org/reports/tr46/#IDNAComparison">Section 7</a>, <a class="glossarylink" href="https://www.unicode.org/glossary/#idna">IDNA</a> Comparison</em> in <a href="https://www.unicode.org/reports/tr46/">UTS #46</a>. For a demonstration of differences between <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>, <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a>, and the Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#idna_compatibility_processing">IDNA Compatibility Processing</a>, see the <a href="https://www.unicode.org/cldr/utility/idna.jsp">IDNA demo</a>.</p> <p class="q" id="13">Q: What are the main advantages of IDNA2008?</p> <p>The main advantages are:</p> <ul> <li>Updates the <a class="glossarylink" href="https://www.unicode.org/glossary/#repertoire">repertoire</a> of allowed characters from Unicode 3.2 to later versions.</li> <li>Makes process of updating to future Unicode versions (mostly) automatic</li> <li>Allows needed sequences (<a class="glossarylink" href="https://www.unicode.org/glossary/#combining_mark">combining marks</a> at the end of a <a class="glossarylink" href="https://www.unicode.org/glossary/#BIDI">bidi</a> label)</li> <li>Improves BIDI restrictions (Arabic/Hebrew)</li> <li>Clarifies that it is the unmapped form of a <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain name</a> that is registered</li> <li>Makes it clear exactly what strings can be registered</li> </ul> <p>The classification of characters under <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> is based on a combination of Unicode <a class="glossarylink" href="https://www.unicode.org/glossary/#property">properties</a>, so implementations can compute them for all Unicode characters. The original <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a> published tables based on Unicode 5.2, but also included some exceptional classification. A series of additional RFCs publishes reviews of and computed tables for later Unicode versions. See <a href="https://www.rfc-editor.org/info/rfc8753">RFC8573<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> as well as the FAQ <a href="#7a">Q: What characters are valid in IDNA2008?</a>.</p> <p class="q" id="14">Q: What are issues of migrating to IDNA2008?</p> <p>If <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> had not existed, there would be no migration issues for <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a>. Given that IDNA2003 does exist, and is still widely deployed, the following issues should be noted:</p> <ul> <li>Changes the interpretation of the 4 characters known as <a href="#7">Deviations</a></li> <li>Discontinues IDNA2003 <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mappings</a> and mappings for other variants</li> <li>Excludes symbols and punctuation</li> <li>Allows arbitrary 'local' mappings, which may result in the same IRI resolving to different IP addresses, depending on the mapping used</li> </ul> </section> <section> <h2><a id="Security"></a>Mitigating Security Concerns for IDNs</h2> <p class="q" id="15aa">Q: What are typical security concerns introduced by IDN?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a> allow a much wider range of character shapes than <a class="glossarylink" href="https://www.unicode.org/glossary/#ASCII">ASCII</a>, as well as scripts that have complex and nonlinear <a class="glossarylink" href="https://www.unicode.org/glossary/#rendering">rendering</a>. Here are some of the concerns that are specific to, or more prominent in IDNs:</p> <ul> <li>IDNs include a wide range of character shapes, including characters that may be: <ul> <li>identical in appearance, even inside the same script</li> <li>from an unfamiliar language or script may look (exactly) like familiar characters</li> <li>historic, obsolete, rarely used or limited to a special domain of usage</li> </ul> </li> <li>IDNs include complex scripts, which add <a class="glossarylink" href="https://www.unicode.org/glossary/#character_sequence">character sequences</a> that: <ul> <li>users and <a class="glossarylink" href="https://www.unicode.org/glossary/#font">fonts</a> expect in a certain order</li> <li>may not render the same everywhere</li> <li>may not be in modern use and may not be supported</li> <li>may look (exactly) like some other character or sequence</li> <li>may require invisible <a class="glossarylink" href="https://www.unicode.org/glossary/#joiner">joiners</a> (<a class="glossarylink" href="https://www.unicode.org/glossary/#zwj">ZWJ</a> or <a class="glossarylink" href="https://www.unicode.org/glossary/#zwnj">ZWNJ</a>) for proper spelling of common words</li> </ul> </li> <li>IDNs include scripts written right to left which: <ul> <li>may lead to re-ordering of the <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain name</a> in display</li> </ul> </li> <li>IDNs include languages that have issues not found in English: <ul> <li>Different countries use different characters for the same letter in the same language</li> <li>Different countries use different characters with equivalent meaning in the same language</li> <li>Some languages have alternate representations for the same letter or <a class="glossarylink" href="https://www.unicode.org/glossary/#syllable">syllable</a>, where both are equally acceptable</li> <li>Users expect access to the same resource under either choice of character</li> </ul> </li> </ul> <p class="q" id="15bb">Q: What are typical ways to mitigate security concerns introduced by IDN?</p> <ul> <li>Limit the <a class="glossarylink" href="https://www.unicode.org/glossary/#repertoire">repertoire</a> to characters in wide-spread modern use</li> <li>Only support recommended scripts (See <a href="https://www.unicode.org/reports/tr31">UAX#31</a>)</li> <li>Allow only those <a class="glossarylink" href="https://www.unicode.org/glossary/#combining_character_sequence">combining character sequences</a> in actual use</li> <li>Add context constraints to ensure <a class="glossarylink" href="https://www.unicode.org/glossary/#character_sequence">character sequences</a> follow the structure of the script</li> <li>Prevent mixing of scripts in the same <a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDN</a> label</li> <li>Prevent mixing of regional or language variants of some characters in the same IDN label</li> <li>Prevent registration of in-script or cross-script confusable labels</li> <li>Strictly limit the use of invisible <a class="glossarylink" href="https://www.unicode.org/glossary/#joiner">joiners</a> (<a class="glossarylink" href="https://www.unicode.org/glossary/#zwj">ZWJ</a> or <a class="glossarylink" href="https://www.unicode.org/glossary/#zwnj">ZWNJ</a>) to where they are absolutely needed</li> </ul> <p><a href="https://www.rfc-editor.org/info/rfc7940">RFC7940<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> provides the framework for a machine-readable implementation of these mitigation steps. For an example of policies intended to reduce the risk in a multi-script zone, see <a href="https://www.icann.org/sites/default/files/lgr/rz-lgr-5-overview-26may22-en.pdf">Root Zone Label Generation Rules<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>.</p> <p class="q" id="15c">Q:What is a confusable IDN label?</p> <p>Two labels are confusable if users accept one for the other. Confusable or look-alike appearance may exist between characters, combinations of characters, or both. Confusable characters may exist inside the same script, or across scripts. Disallowing mixed-script labels cuts down on possible combinations, but it is possible to create single-script labels that are confusable across scripts.</p> <table class="simple center"> <tr><th colspan="2" style="text-align:center">Examples of confusable characters</th> </tr> <tr> <td style="text-align:center"><span style="font-size:200%">è</span><br> U+0065<br> Latin<br> </td> <td style="text-align:center"><span style="font-size:200%">ѐ</span><br> U+0450<br> Cyrillic</td> </tr> <tr> <td style="text-align:center"><span style="font-size:200%">ə</span><br> U+0259<br> Latin</td> <td style="text-align:center"><span style="font-size:200%">ǝ</span><br> U+01DD<br> Latin</td> </tr> </table> <p>For more information, see <a class="glossarylink" href="https://www.unicode.org/glossary/#uts">UTS</a> #39, &quot;<a href="https://www.unicode.org/reports/tr39">Unicode Sercurtiy Mechanisms</a>&quot;.</p> <p class="q" id="15">Q: How did the treatment of symbols like √ change from IDNA2003 to IDNA2008?</p> <p>While <span class="linkstyle">http://√.com</span> is valid in an <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> implementation, it would fail on a <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> implementation. At the time <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a> was introduced this affected 3,254 characters, most of which are rarely used. A small percentage of those are security risks because of confusability. The vast majority are unproblematic: for example, having <span class="linkstyle">http://I♥NY.com</span> doesn&#39;t cause security problems. IDNA2008 has additional tests that are based on the context in which characters are found, but they apply to few characters, and don't provide any appreciable increase in security.</p> <p>The issue may be different for newer symbols like <a class="glossarylink" href="https://www.unicode.org/glossary/#emoji">emoji</a>, particularly because the appearance for an emoji is not specified and users cannot know whether some image that looks a bit different is a different emoji or just a difference in rendition.</p> <p class="q" id="15d"> Q: What makes emoji particularly unsuited for IDNs?</p> <p>In 2017, the ICANN Security And Stability Committee (SSAC) released an <a href="https://features.icann.org/ssac-advisory-use-emoji-domain-names">Advisory on the Use of Emoji in Domain Names<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>. In it SSAC member conclude that <a class="glossarylink" href="https://www.unicode.org/glossary/#emoji">emoji</a> are fundamentally at odds with the way the DNS is designed to be an &quot;exact match lookup&quot; system. They cite the fact that different emoji may look highly similar to users due to the way they lack a conventional agreed upon appearance, such as exhibited by letters in an <a class="glossarylink" href="https://www.unicode.org/glossary/#alphabet">alphabet</a>, and that their appearance is not prescribed or regulated by the emoji specification as key findings. Additionally, they note that the modifiers, such as for skin tone or color, as well the combinations make the set somewhat open ended and subject to inconsistent implementation across devices. Taken together, these findings lead them to the conclusion to reject emoji for identifiers outright for <a class="glossarylink" href="https://www.unicode.org/glossary/#tld">TLDs</a> and to strongly warn against their use anywhere.</p> <p class="q" id="16">Q: What are typical security exploits?</p> <p>The vast majority of security exploits are of the form &quot;security-wellsfargo.com&quot;, where no special characters are involved. For more information, see Stéphane Bortzmeyer's blog entry, <a href="http://www.bortzmeyer.org/idn-et-phishing.html">idn-et-phishing <img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>(in French). The most interesting studies cited there (originally from Mike Beltzner of&nbsp;Mozilla) are:</p> <ul> <li><em><a href="http://cups.cs.cmu.edu/soups/2006/proceedings/p79_downs.pdf" hreflang="en">Decision Strategies and Susceptibility to Phishing<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a></em> by Downs, Holbrook &amp; Cranor</li> <li><em><a href="http://people.ischool.berkeley.edu/~hearst/papers/why_phishing_works.pdf" hreflang="en">Why Phishing Works<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a></em> by Dhamija, Tygar &amp; Hearst</li> <li><em><a href="http://www.simson.net/ref/2006/CHI-security-toolbar-final.pdf" hreflang="en">Do Security Toolbars Actually Prevent Phishing Attacks<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a></em> by Wu, Miller &amp; Garfinkel</li> <li><em><a href="http://www.cs.auckland.ac.nz/~pgut001/pubs/phishing.pdf" hreflang="en">Phishing Tips and Techniques<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a></em> by Gutmann.</li> </ul> <p>Even among the fraction that are confusable characters, merely limiting the allowed characters to letters and <a class="glossarylink" href="https://www.unicode.org/glossary/#digits">digits</a> doesn't by itself do anything about the most frequent sources of character-based spoofing: look-alike characters that are both letters, like &quot;<span class="linkstyle">http://paypal.com</span>&quot; with a Cyrillic &quot;a&quot;.</p> <p>According to data from Google, the removal of symbols and punctuation in <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a> reduces opportunities for spoofing by only about 0.000016%, weighted by frequency. In another study at Google of a billion web pages, the top 277 confusable URLs used confusable letters or numbers, not symbols or punctuation. The 278th page had a confusable URL with × (U+00D7 <span class="name">MULTIPLICATION SIGN</span> - by far the most common confusable); but that page could could be even better spoofed with х (U+0445 <span class="name">CYRILLIC SMALL LETTER HA</span>), which normally has precisely the same displayed shape as &quot;x&quot;.</p> <p>For a demo of confusable characters, and the effects of various restrictions, see the <a href="https://www.unicode.org/cldr/utility/confusables.jsp">confusables demo</a>.</p> <p>This points to the need to carefully consider spoofing issues within the <a class="glossarylink" href="https://www.unicode.org/glossary/#repertoire">repertoire</a> of letters allowed for registration for a given zone. Removing rarely used, and therefore unfamiliar, letter forms is one strategy. Another is to use the mechanisms of <a href="https://www.rfc-editor.org/info/rfc7940">RFC7940<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> to create a machine-readable specification for limiting sequences of letters that are structurally invalid for a script and liable to be rendered like a different, valid sequence. The third strategy to define characters that are variants of each other. This information can be used to enforce mutual exclusion of labels containing look-alike characters, whether in-script or cross-script.</p> <p>Programmers also need to be aware of the issues detailed in <a href="https://www.unicode.org/reports/tr36/">UTR #36: Unicode Security Considerations</a>, including the mechanisms for detecting potentially visually-confusable characters are found in the associated <a href="https://www.unicode.org/reports/tr39/">UTS #39: Unicode Security Mechanisms</a>.</p> <p class="q" id="20">Q: Are the local mappings in IDNA2008 just a UI issue?</p> <p>No, not if what is meant is that they are only involved in interactions with the address bar.</p> <p><i>Example:</i></p> <ul> <li>Alice sees that a URL works in her browser (say <span class="linkstyle">http://faß.de</span> or <span class="linkstyle">http://TÜRKIYE.com</span>). She sends it to Bob in an email<span >. Bob clicks on the link in his email, and doesn't find a site or goes to a wrong (and potentially malicious) site</span>, because his browser maps to <span class="linkstyle">http://fass.de</span> or <span class="linkstyle">http://türkiye.com</span> while Alice&#39;s maps to <span class="linkstyle">http://faß.de</span> or <span class="linkstyle">http://türkıye.com</span>.</li> </ul> <p>There are parallel examples with web pages, IM chats, Word documents, etc.</p> <ul> <li>Alice creates a web page, using &lt;a href=&quot;<span class="linkstyle">http://faß.de</span>&quot;&gt; (or <span class="linkstyle">http://TÜRKIYE.com</span>). <span >Bob clicks on the link in the web page, and doesn't find a site or goes to a wrong (and potentially malicious) site</span>.</li> <li>Alice is in a IM chat with Bob. She copies in <span class="linkstyle">http://faß.de</span> (or <span class="linkstyle">http://TÜRKIYE.com</span>) and hits return. Bob clicks on the link he sees in his chat window. <span >Bob clicks on the link in his email, and doesn't find a site or goes to a wrong (and potentially malicious) site</span>.</li> <li>Alice sends a Word document to Bob with a link in it...</li> <li>Alice creates a PDF document...</li> </ul> <p class="q" id="21">Q: Do the local-mapping exploits require unscrupulous registries?</p> <p>No. The exploits do not require unscrupulous registries—it only requires that registries fail to police every URL that they register for possible spoofing behavior.</p> <p>The local mappings matter to security, because entering the same URL on two different browsers may go to two different IP addresses when the two browsers have different local mappings. The same thing could happen within an emailer that is parsing for URLs, and then opening a browser. Moreover, they are even more problematic if they affect the interpretation of web pages, in such as cases like href=&quot;<span class="linkstyle">http://TÜRKIYE.com</span>&quot;.</p> </section> <section> <h2><a id="script"></a>Script , language and IDNs</h2> <p class="q" id="22a">Q: What is the issue with German sharp s (ß) versus &quot;ss&quot;?</p> <p>In German, the standard <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> of ß is &quot;SS&quot;, the same as the uppercase of &quot;ss&quot;. Note, for example, that on the German language page for <a href="http://www.uni-giessen.de">http://www.uni-giessen.de<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>, &quot;Gießen&quot; is spelled with ß, but the logo for the university (see the top left corner of the page) is spelled with GIESSEN. The situation is even more complicated:</p> <ul> <li>In Switzerland, &quot;ss&quot; is uniformly used instead of ß.</li> <li>The recent spelling reform in Germany and Austria changed whether ß or &quot;ss&quot; is used in many words. For example, <span class="linkstyle">http://Schloß.de</span> was the spelling before 1996, and <span class="linkstyle">http://Schloss.de</span> is the spelling after.</li> <li>There are a number of word pairs in German where the distinction between ß or &quot;ss&quot; is the only difference; there are also words that have both in the same word. Examples: Masse, Maße, Massenzusammenstoß</li> <li>In Unicode 5.1, an uppercase version of ß was added (ẞ). While it has since been officially recognized as an alternate uppercase it is not now, however, the preferred uppercase of ß in German standards, nor is it known whether it will ever become the preferred uppercase. Unicode now treats all of these as a single <a class="glossarylink" href="https://www.unicode.org/glossary/#equivalence">equivalence</a> class for case-insensitive matching: {ss, ß, SS, ẞ}. See also the <a href="https://www.unicode.org/faq/casemap_charprop.html#11">Unicode FAQ</a>.</li> <li>The <a href="http://www.nic.de">German NIC<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> (responsible for .de) has supported separate registration of domains with both ß to &quot;ss&quot; from 2010.</li> <li>The <a href="http://www.nic.at">Austrian<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> <a href="http://en.wikipedia.org/wiki/Network_Information_Centre">NIC<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> (responsible for .at) favored keeping the mapping from ß to &quot;ss&quot; and <a href="https://www.nic.at/media/files/pdf/IDN_Charset.pdf">does not allow<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> ß.</li> <li>Some of the new gTLD registries are treating ß and &quot;ss&quot; as variants to ensure proper resolution of names that would otherwise be mapped to &quot;ss&quot;.</li> <li>The <a href="https://www.icann.org/sites/default/files/lgr/rz-lgr-5-latin-script-26may22-en.html">Root Zone Label Generation Rules for the Latin Script<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>, published by ICANN in 2022 allow either ß or &quot;ss&quot; at a given position, but with a provision for bundling a label having all &quot;ss&quot; with any label containing ß.</li> </ul> <p class="q" id="22b">Q: What is the issue with Greek final sigma (ς)?</p> <p>The Greek sigma (σ) takes a final form (ς) at the end of a word. In <a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a>, where there are no spaces, a final form may show up in the middle of a label. Because both have the same <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> form (Σ) labels with and without final sigma cannot be distinguished when presented to the user as uppercase (and <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> maps them together). The <a href="https://www.icann.org/sites/default/files/lgr/rz-lgr-5-greek-script-26may22-en.html">Root Zone Label Generation Rules for the Greek Script<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> allow either (σ) or (ς) at a given position, but with a provision for bundling a label having all (σ) with any label containing a final sigma (ς). These are also considered variants in the Greek ccTLD.</p> <p class="q" id="25">Q: Aren't the problems with eszett and final sigma just the same as with l, I, and 1?</p> <p>The eszett and sigma are fundamentally different than I (capital i), l (<a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> L), and 1 (<a class="glossarylink" href="https://www.unicode.org/glossary/#digits">digit</a> one). With the following (using a digit 1), all browsers will go to the same location, whether they are old or new:</p> <blockquote> <p class="linkstyle">http://goog1e.com</p> </blockquote> <p>In the following hypothetical example using a <a class="glossarylink" href="https://www.unicode.org/glossary/#top_level_domain">top-level domain</a> "xx", browsers that use <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> will go to a different location than browsers that use a strict version of <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a>, <em>unless</em> the registry for xx puts into place a <a href="https://www.unicode.org/reports/tr46/#Registries">bundling</a> strategy.</p> <blockquote> <p class="linkstyle">http://gießen.xx</p> </blockquote> <p>The same goes for Greek <em>sigma</em>, which is a more common character in Greek than the <em>eszett</em> is in German.</p> <p class="q" id="22">Q: Why does IDNA2003 map final sigma (ς) to sigma (σ) and (ß) to &quot;ss&quot; and (i) to (ı)?</p> <p>This decision about the mapping of these characters followed recommendations for case-insensitive matching in the Unicode Standard. These characters are anomalous: the <a class="glossarylink" href="https://www.unicode.org/glossary/#uppercase">uppercase</a> of ς is Σ, the same as the uppercase of σ. Note that the text &quot;ΒόλοΣ.com&quot;, which appears on <span class="linkstyle">http://Βόλος.com</span>, illustrates this: the normal <a class="glossarylink" href="https://www.unicode.org/glossary/#case_mapping">case mapping</a> of Σ is to σ. If σ and ς were not treated as case variants in Unicode, there wouldn&#39;t be a match between ΒόλοΣ and Βόλος.</p> <p>For full case insensitivity (with transitivity), {σ, ς, Σ}, {i, ı, I and İ} and {ss, ß, SS} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>. While IDNA2003 did not have to have full case transitivity, that is water under the bridge.</p> <p class="q" id="17">Q: How does IDNA2008 improve handling of Arabic and Hebrew (BIDI)?</p> <p>Arabic and Hebrew <a class="glossarylink" href="https://www.unicode.org/glossary/#writing_system">writing systems</a> are known as <a class="glossarylink" href="https://www.unicode.org/glossary/#BIDI">bidi</a> (bidirectional) because text runs from right-to-left and numbers (or embedded Latin characters) from left-to-right. <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> does a better job of restricting labels that lead to &quot;bidi label hopping&quot;. This is where bidi reordering causes characters from one label to appear to be part of another label. For example, &quot;B1.2d&quot; in a right-to-left paragraph (where B stands for an Arabic or Hebrew letter) would display as &quot;1.2dB&quot;. For more information, see the <a href="https://www.unicode.org/cldr/utility/bidi.jsp">Unicode bidi demo</a>.</p> <p>While these new bidi rules go a long way towards reducing this problem, they do not completely eliminate it because they do not check for inter-label problems.</p> <p class="q" id="23">Q: Why allow ZWJ/ZWNJ at all?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#zwj">ZWJ</a> and <a class="glossarylink" href="https://www.unicode.org/glossary/#zwnj">ZWNJ</a> are normally invisible, which allows them to be used for a variety of spoofs. Invisible characters (like these and soft-hyphen) are allowed on input in <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>, but are deleted so that they do not allow spoofs. During the development of Unicode, the ZWJ and ZWNJ were intended only for presentation — that is, they would make no difference in the semantics of a word.</p> <p>However, in some cases, what used to be presentational alternatives became semantically distinct. For example, there are words such as the Sinhala name of the country of Sri Lanka (ශ්‍රී ලංකාව), which require preservation of these <a class="glossarylink" href="https://www.unicode.org/glossary/#joiner">joiners</a> (in this case, ZWJ) to achieve the correct spelling. The Root Zone excludes the ZWJ because of the heightened security sensitivity for the Root Zone. However, the <a href="https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-sinhala-script-22apr21-en.html">Reference Label Generation Rules for the Sinahala Script<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> published by ICANN for use on the second level allow the use of ZWJ, but only in the context of a few dozen explicitly enumerated combinations.</p> <p class="q" id="24">Q: But aren't the deviation characters needed for the orthographies of some languages?</p> <p>While these are full parts of the orthographies of the languages in question, neither <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a> nor <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a> ever claimed that all parts of every language's orthographies are representable in <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain names</a>. There are trivial examples even in English, like the word <strong><em>can't</em></strong> (vs <em><strong>cant</strong></em>) or <strong><em>Wendy's/Arby's Group</em></strong>, which use standard English orthography but cannot be represented faithfully in a domain name.</p> <p class="q" id="26b">Q: Are there registries that restrict domain names on the basis of language?</p> <p>While it may be difficult to find a clear cutoff for restricting <a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a> on the basis of language, there are many registries that have <a href="https://www.iana.org/domains/idn-tables">language-specific registration policies<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>, and ICANN publishes a set of language-specific <a href="https://www.icann.org/resources/pages/second-level-lgr-2015-06-21-en">Reference LGRs for the Second Level<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a>.</p> <p>The main concern is that the set of letters used in a particular language is not well defined. The &quot;core&quot; letters typically are, but many additional ones may be accepted in loan words, and have perfectly legitimate commercial and social use. Sometimes the same language used in different regions may use different letters; other times, the interest may be more in supporting a particular country, than a specific language. The latter applies to many ccTLDS. In all of these cases, the allowed <a class="glossarylink" href="https://www.unicode.org/glossary/#repertoire">repertoire</a> may not be strictly language-based but will be a subset of a full script's repertoire.</p> <p class="q" id="26a">Q: Are there registries that restrict domain names on the basis of script?</p> <p>It is a bit easier to maintain a clear distinction based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). However, some languages, such as Japanese, require multiple scripts. And in such cases, mixtures of scriptsmay be appropriate. One can have <span class="linkstyle">http://SONY日本.com</span> with no problems at all—while there are many cases of &quot;homographs&quot; (visually confusable characters) within the same script that a restriction based on script doesn&#39;t deal with.</p> <p>As one prominent example, the DNS Root Zone supports <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain names</a> on the basis of script: with few exceptions for inherently multi-script <a class="glossarylink" href="https://www.unicode.org/glossary/#writing_system">writing systems</a> each label must be in a single script. However, labels from different scripts share the Root Zone. The issue of true homographs within and across scripts is addressed not by <a class="glossarylink" href="https://www.unicode.org/glossary/#repertoire">repertoire</a> restriction but by mutual exclusion via definition of variants.</p> <p class="q" id="15a">Q: What is a recommended script?</p> <p><a class="glossarylink" href="https://www.unicode.org/glossary/#uax">UAX</a> #31 defines a number of scripts as &quot;Recommended&quot; for use in identifiers. All of these scripts are in &quot;widespread common everyday use&quot; by large communities for writing modern languages and that are actively being used by the respective user communities to conduct their ordinary and daily online business. Where there is some, but not sufficient level of use, a script may be designated &quot;Limited_Use&quot;. This classification neither prevents use of these scripts for <a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a> absolutely in any zone other than the DNS Root Zone, nor does it affect the use of the script in creating online content. Scripts limited to specialized use only, like archaic scripts, are classified as &quot;Excluded&quot;.</p> <p class="q" id="15b">Q: Can additional scripts become recommended?</p> <p>Which scripts are used to write a given language may change over time. Whether a script is recommended or not is not frozen in time, so Unicode is able to track such changes in usage. A number of scripts that are classified as Limited_Use have the potential to become recommended, if at some point in the future their observed and documented level of usage rises to the level of &quot;widespread common everyday use&quot;. Any suggestion to make such a change would need be accompanied by thorough documentation of pervasive online use of the script in daily life. Unlike the use of a script to publish dictionaries or otherwise digitally preserve a written culture, the use of <a class="glossarylink" href="https://www.unicode.org/glossary/#idn">IDNs</a> is to facilitate day-to-day online interactions by users of the script. Therefore, the degree to which such a language community engages in online transactions using that script is the most important data point.</p> <p class="q" id="26">Q: Should the IDNA protocol restrict allowed domain names on the basis of language or script?</p> <p>The rough consensus among the IETF <a class="glossarylink" href="https://www.unicode.org/glossary/#idna">IDNA</a> working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a> is no different than <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>. IDNA doesn&#39;t try to attack the homograph problem, because it is too difficult to maintain a clear distinction. Effective solutions depend on information or capabilities outside of the protocol&#39;s control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.</p> <p>Responsible registries can apply such restrictions. For example, a country-level registry can decide on a restricted set of characters appropriate for that country&#39;s languages. Application software can also take certain precautions—Microsoft Edge, Safari, Firefox, and Chrome all display <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain names</a> in Unicode only if the user&#39;s language(s) typically use the scripts in those domain names. For more information on the kinds of techniques that implementations can use on the Unicode web site, see <a href="https://www.unicode.org/reports/tr36/">UTR #36: Unicode Security Considerations</a>.</p> </section> <section> <h2><a id="Implementation"></a>Implementation Issues and Strategies for IDN</h2> <p class="q" id="27">Q: Are there differences in mapping between UTS #46 and IDNA2003?</p> <p>No. There are, however, 56 characters that are valid or mapped under <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>, but are disallowed by <a class="glossarylink" href="https://www.unicode.org/glossary/#uts">UTS</a> #46. For a detailed table of differences between <em>UTS #46</em> and <em><a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a></em>, see <em><a href="https://www.unicode.org/reports/tr46/#IDNAComparison">Section 7</a>, <a class="glossarylink" href="https://www.unicode.org/glossary/#idna">IDNA</a> Comparison</em> in <a href="https://www.unicode.org/reports/tr46/">UTS #46</a>.</p> <p>In particular, there are collections of characters that would have changed mapping according to NFKC_Casefold after Unicode 3.2, unless they were specifically excluded. All of these characters are extremely rare, and do not require any special handling.</p> <p><strong>Case Pairs.</strong> These are characters that did not have corresponding <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> characters in Unicode 3.2, but had lowercase characters added later.</p> <blockquote> <p>U+04C0 ( Ӏ ) <span class="name">CYRILLIC LETTER PALOCHKA</span><br> U+10A0 ( Ⴀ ) <span class="name">GEORGIAN CAPITAL LETTER AN</span>…U+10C5 ( Ⴥ ) <span class="name">GEORGIAN CAPITAL LETTER HOE</span><br> U+2132 ( Ⅎ ) <span class="name">TURNED CAPITAL F</span><br> U+2183 ( Ↄ ) <span class="name">ROMAN NUMERAL REVERSED ONE HUNDRED</span></p> </blockquote> <p>After Unicode 3.2, the <a class="glossarylink" href="https://www.unicode.org/glossary/#unicode_consortium">Unicode Consortium</a> has stabilized <a class="glossarylink" href="https://www.unicode.org/glossary/#case_folding">case folding</a>, so that further examples will not occur in the future. That is, case pairs will be assigned in the same version of Unicode—so any newly <a class="glossarylink" href="https://www.unicode.org/glossary/#assigned_character">assigned character</a> will either have a case folding in that version of Unicode, or it will never have a case folding in the future.</p> <p><strong><a class="glossarylink" href="https://www.unicode.org/glossary/#normalization">Normalization</a> Mappings.</strong> These are five characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0: see <a href="https://www.unicode.org/versions/corrigendum4.html" rel="nofollow" target="_blank">Corrigendum #4: Five Unihan Canonical Mapping Errors</a>). As of Unicode 5.1, normalization is completely stabilized, so these are the only such characters.</p> <p class="q" id="29">Q: What are possible strategies for preparing IDNs in a display form preferred by target sites?</p> <p>Labels presented to a browser may or may not be in the display form preferred by a target site. For example, a site may have a preferred display form of “HumanEvents.com”, but an href tag in another site may display “HumaneVents.com”. Similarly, a user may type “Floß.com” in the browser’s address bar, and that would resolve to the site “floss.com”, though it is unclear whether the display form preferred by owners of that site is “Floss.com”, “floss.com”, “Floß.com”, or “floß.com”. There is no way currently for the browser to know whether the labels are in a preferred form or not.</p> <p>It may be useful to develop mechanisms to allow browsers to determine the display form preferred by a target site, and then for browsers to display that form. One could foresee something being developed along the lines of the <a href="http://en.wikipedia.org/wiki/Favicon">favicon<img src="https://www.unicode.org/img/external_link.png" alt="external link"></a> approach. The mechanisms would need to have restrictions put into place to address misrepresentations. For example, the browser should verify that the site's preferred display form has the same lookup form: if the href is &quot;<span class="linkstyle">http://βόλοσ.com</span>&quot;, and the site's preferred display form is &quot;<span class="linkstyle">http://Βόλος.com</span>&quot;, then the preferred display form could be used; if the site's preferred display form is <span class="linkstyle">"http://Βόλλος.com</span>&quot;, then it would not be used, because it doesn't have the same lookup form as the href. Other security checks should be made, such as to prevent display forms like &quot;appIe.com&quot; (with a capital I) for &quot;appie.com&quot; (with a <a class="glossarylink" href="https://www.unicode.org/glossary/#lowercase">lowercase</a> i).</p> <p class="q" id="32">Q: How are label delimiters handled in implementations of IDNA?</p> <p>The processing of <a href="https://www.unicode.org/reports/tr46/">UTS #46</a> matches what is commonly done with label delimiters by browsers, whereby characters containing periods are transformed into the <a class="glossarylink" href="https://www.unicode.org/glossary/#nfkc">NFKC</a> format <i>before</i> labels are separated. This allows the <a class="glossarylink" href="https://www.unicode.org/glossary/#domain_name">domain name</a> to be mapped in a single pass, rather than label by label. However, except for the four label separators provided by <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2003">IDNA2003</a>, all input characters that would map to a period are disallowed. For example, <code><a href="https://www.unicode.org/cldr/utility/character.jsp?a=2488">U+2488</a></code> ( ⒈ ) <span class="name">DIGIT ONE FULL STOP</span> has a <a class="glossarylink" href="https://www.unicode.org/glossary/#decomposition">decomposition</a> that maps to a period, and is thus disallowed. The exact list of characters can be seen with the Unicode utilities using a regular expression:</p> <blockquote> <p><a title="https://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toNFKC=/\./:]" href="https://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:toNFKC=/%5C./:%5D">https://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{toNFKC=/\./}</a></p> </blockquote> <p>The question also arises as to how to handle escaped periods (such as %2E). While escaping of periods is outside of the scope of this document, it is useful to see how both of these cases are handled in current browsers:</p> <table class="faq center"> <tr> <td>Input:</td> <td>http://à%2Ecom</td> <td>%2E</td> <td>http://à⒈com</td> <td style="text-align:center">⒈</td> </tr> <tr> <td>Microsoft Edge</td> <td class="works">http://xn--0ca.com/</td> <td class="works">= &quot;.&quot;</td> <td class="works">http://xn--1-rfa.com/</td> <td class="works">= &quot;1.&quot; </td> </tr> <tr> <td>Firefox</td> <td class="fails">http://www.xn--.com-hta.com/</td> <td class="fails">≠ &quot;.&quot;</td> <td class="works">http://xn--1-rfa.com/</td> <td class="works">= &quot;1.&quot;</td> </tr> <tr> <td>Safari / Chrome</td> <td class="works">http://xn--0ca.com/</td> <td class="works">= &quot;.&quot;</td> <td class="fails">http://xn--1.com-qqa/</td> <td class="fails">≠ &quot;1.&quot;</td> </tr> </table> <p>There are three possible behaviors for characters such as <code><a href="https://www.unicode.org/cldr/utility/character.jsp?a=2488">U+2488</a></code> ( ⒈ ) <span class="name">DIGIT ONE FULL STOP</span>:</p> <ol> <li>The dot behaves like a label separator.</li> <li>The character is rejected.</li> <li>The dot is included in the label, as shown in the garbled <a class="glossarylink" href="https://www.unicode.org/glossary/#punycode">punycode</a> seen above in the ≠ cases.</li> </ol> <p>The conclusion of the Unicode Technical Committee was that the best behavior for <a href="https://www.unicode.org/reports/tr46/">UTS #46</a> was #2, to forbid all characters (other than the 4 label separators) that contained a <span class="name">FULL STOP</span> in their <a class="glossarylink" href="https://www.unicode.org/glossary/#compatibility_decomposition">compatibility decompositions</a>. This is the same behavior as IDNA2003. Although this policy is not the current policy of the majority of browser implementations, the browser vendors agreed that the change is desirable.</p> <p class="q" id="33">Q: For IDNA2008, what is the derivation of valid characters in terms of Unicode properties?</p> <p>Using formal set notation, the following describes the set of allowed characters defined by <a href="https://www.unicode.org/reports/tr41/#IDNA2008">IDNA2008</a>. This set corresponds to the union of the PVALID, CONTEXTJ, and CONTEXTO characters defined by the Tables document of <a class="glossarylink" href="https://www.unicode.org/glossary/#idna2008">IDNA2008</a>. Unicode provides the <a href="https://www.unicode.org/Public/idna/idna2008derived/">result of this derivation </a>for every applicable version of the standard including the currently defined exceptions.</p> <table class="faq"> <tr> <th style="width:50%">Formal Sets</th> <th>Descriptions</th> </tr> <tr> <td><code>[ \P{Changes_When_NFKC_Casefolded}</code></td> <td>Start with characters that are NFKC Case folded (as in IDNA2003)</td> </tr> <tr> <td><code>\- \p{c} - \p{z}</code></td> <td>Remove Control Characters and Whitespace (as in IDNA2003)</td> </tr> <tr> <td><code>\- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}</code></td> <td>Remove Symbols, Punctuation, non-decimal Numbers, and Enclosing Marks</td> </tr> <tr> <td><code>\- \p{HST=L} - \p{HST=V} - \p{HST=V}</code></td> <td>Remove characters used for archaic Hangul (Korean)</td> </tr> <tr> <td><code>\- \p{block=Combining_Diacritical_Marks_For_Symbols}<br> - \p{block=Musical_Symbols}<br> - \p{block=Ancient_Greek_Musical_Notation}</code></td> <td>Remove three blocks of technical or archaic symbols.</td> </tr> <tr> <td><code>\- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]</code></td> <td>Remove certain exceptions:<br> U+0640 ( ‎ـ‎ ) <span class="name">ARABIC TATWEEL</span><br> U+07FA ( ‎ߺ‎ ) <span class="name">NKO LAJANYALAN</span><br> U+302E (&nbsp;〮 ) <span class="name">HANGUL SINGLE DOT TONE MARK</span><br> U+302F (&nbsp;〯 ) <span class="name">HANGUL DOUBLE DOT TONE MARK</span><br> U+3031 ( 〱 ) <span class="name">VERTICAL KANA REPEAT MARK</span><br> U+3032 ( 〲 ) <span class="name">VERTICAL KANA REPEAT WITH VOICED SOUND MARK</span><br> ..<br> U+3035 ( 〵 ) <span class="name">VERTICAL KANA REPEAT MARK LOWER HALF</span><br> U+303B ( 〻 ) <span class="name">VERTICAL IDEOGRAPHIC ITERATION MARK</span></td> </tr> <tr> <td><code>\>+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]<br> + [\u002D \u06FD \u06FE \u0F0B \u3007] </code></td> <td>Add certain exceptions:<br> U+00B7 ( · ) <span class="name">MIDDLE DOT</span><br> U+0375 ( ͵ ) <span class="name">GREEK LOWER NUMERAL SIGN</span><br> U+05F3 ( ‎׳‎ ) <span class="name">HEBREW PUNCTUATION GERESH</span><br> U+05F4 ( ‎״‎ ) <span class="name">HEBREW PUNCTUATION GERSHAYIM</span><br> U+30FB ( ・ ) <span class="name">KATAKANA MIDDLE DOT</span><br> <em>plus</em><br> U+002D ( - ) <span class="name">HYPHEN-MINUS</span><br> U+06FD ( ‎۽‎ ) <span class="name">ARABIC SIGN SINDHI AMPERSAND</span><br> U+06FE ( ‎۾‎ ) <span class="name">ARABIC SIGN SINDHI POSTPOSITION MEN</span><br> U+0F0B ( ་ ) <span class="name">TIBETAN MARK INTERSYLLABIC TSHEG</span><br> U+3007 ( 〇 ) <span class="name">IDEOGRAPHIC NUMBER ZERO</span> </td> </tr> <tr> <td><code>\+ [\u00DF \u03C2]</code><br> <code>\+ \p{JoinControl}]</code></td> <td>Add special exceptions (Deviations):<br> U+00DF ( ß ) <span class="name">LATIN SMALL LETTER SHARP S</span><br> U+03C2 ( ς ) <span class="name">GREEK SMALL LETTER FINAL SIGMA</span><br> U+200C ( ) <span class="name">ZERO WIDTH NON-JOINER</span><br> U+200D ( ) <span class="name">ZERO WIDTH JOINER</span></td> </tr> </table> </section> </main> <!-- END CONTENTS --> <footer> <hr> <div><a href="https://www.unicode.org/copyright.html"> <img src="https://www.unicode.org/img/hb_notice.gif" alt="Access to Copyright and terms of use"></a></div> </footer> <script src="toc.js"></script> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10