CINXE.COM

PEP 672 – Unicode-related Security Considerations for Python | peps.python.org

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="color-scheme" content="light dark"> <title>PEP 672 – Unicode-related Security Considerations for Python | peps.python.org</title> <link rel="shortcut icon" href="../_static/py.png"> <link rel="canonical" href="https://peps.python.org/pep-0672/"> <link rel="stylesheet" href="../_static/style.css" type="text/css"> <link rel="stylesheet" href="../_static/mq.css" type="text/css"> <link rel="stylesheet" href="../_static/pygments.css" type="text/css" media="(prefers-color-scheme: light)" id="pyg-light"> <link rel="stylesheet" href="../_static/pygments_dark.css" type="text/css" media="(prefers-color-scheme: dark)" id="pyg-dark"> <link rel="alternate" type="application/rss+xml" title="Latest PEPs" href="https://peps.python.org/peps.rss"> <meta property="og:title" content='PEP 672 – Unicode-related Security Considerations for Python | peps.python.org'> <meta property="og:description" content="This document explains possible ways to misuse Unicode to write Python programs that appear to do something else than they actually do."> <meta property="og:type" content="website"> <meta property="og:url" content="https://peps.python.org/pep-0672/"> <meta property="og:site_name" content="Python Enhancement Proposals (PEPs)"> <meta property="og:image" content="https://peps.python.org/_static/og-image.png"> <meta property="og:image:alt" content="Python PEPs"> <meta property="og:image:width" content="200"> <meta property="og:image:height" content="200"> <meta name="description" content="This document explains possible ways to misuse Unicode to write Python programs that appear to do something else than they actually do."> <meta name="theme-color" content="#3776ab"> </head> <body> <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"> <symbol id="svg-sun-half" viewBox="0 0 24 24" pointer-events="all"> <title>Following system colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <circle cx="12" cy="12" r="9"></circle> <path d="M12 3v18m0-12l4.65-4.65M12 14.3l7.37-7.37M12 19.6l8.85-8.85"></path> </svg> </symbol> <symbol id="svg-moon" viewBox="0 0 24 24" pointer-events="all"> <title>Selected dark colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <path stroke="none" d="M0 0h24v24H0z" fill="none"></path> <path d="M12 3c.132 0 .263 0 .393 0a7.5 7.5 0 0 0 7.92 12.446a9 9 0 1 1 -8.313 -12.454z"></path> </svg> </symbol> <symbol id="svg-sun" viewBox="0 0 24 24" pointer-events="all"> <title>Selected light colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <circle cx="12" cy="12" r="5"></circle> <line x1="12" y1="1" x2="12" y2="3"></line> <line x1="12" y1="21" x2="12" y2="23"></line> <line x1="4.22" y1="4.22" x2="5.64" y2="5.64"></line> <line x1="18.36" y1="18.36" x2="19.78" y2="19.78"></line> <line x1="1" y1="12" x2="3" y2="12"></line> <line x1="21" y1="12" x2="23" y2="12"></line> <line x1="4.22" y1="19.78" x2="5.64" y2="18.36"></line> <line x1="18.36" y1="5.64" x2="19.78" y2="4.22"></line> </svg> </symbol> </svg> <script> document.documentElement.dataset.colour_scheme = localStorage.getItem("colour_scheme") || "auto" </script> <section id="pep-page-section"> <header> <h1>Python Enhancement Proposals</h1> <ul class="breadcrumbs"> <li><a href="https://www.python.org/" title="The Python Programming Language">Python</a> &raquo; </li> <li><a href="../pep-0000/">PEP Index</a> &raquo; </li> <li>PEP 672</li> </ul> <button id="colour-scheme-cycler" onClick="setColourScheme(nextColourScheme())"> <svg aria-hidden="true" class="colour-scheme-icon-when-auto"><use href="#svg-sun-half"></use></svg> <svg aria-hidden="true" class="colour-scheme-icon-when-dark"><use href="#svg-moon"></use></svg> <svg aria-hidden="true" class="colour-scheme-icon-when-light"><use href="#svg-sun"></use></svg> <span class="visually-hidden">Toggle light / dark / auto colour theme</span> </button> </header> <article> <section id="pep-content"> <h1 class="page-title">PEP 672 – Unicode-related Security Considerations for Python</h1> <dl class="rfc2822 field-list simple"> <dt class="field-odd">Author<span class="colon">:</span></dt> <dd class="field-odd">Petr Viktorin &lt;encukou&#32;&#97;t&#32;gmail.com&gt;</dd> <dt class="field-even">Status<span class="colon">:</span></dt> <dd class="field-even"><abbr title="Currently valid informational guidance, or an in-use process">Active</abbr></dd> <dt class="field-odd">Type<span class="colon">:</span></dt> <dd class="field-odd"><abbr title="Non-normative PEP containing background, guidelines or other information relevant to the Python ecosystem">Informational</abbr></dd> <dt class="field-even">Created<span class="colon">:</span></dt> <dd class="field-even">01-Nov-2021</dd> <dt class="field-odd">Post-History<span class="colon">:</span></dt> <dd class="field-odd">01-Nov-2021</dd> </dl> <hr class="docutils" /> <section id="contents"> <details><summary>Table of Contents</summary><ul class="simple"> <li><a class="reference internal" href="#abstract">Abstract</a></li> <li><a class="reference internal" href="#introduction">Introduction</a></li> <li><a class="reference internal" href="#acknowledgement">Acknowledgement</a></li> <li><a class="reference internal" href="#confusing-features">Confusing Features</a><ul> <li><a class="reference internal" href="#ascii-only-considerations">ASCII-only Considerations</a><ul> <li><a class="reference internal" href="#confusables-and-typos">Confusables and Typos</a></li> <li><a class="reference internal" href="#control-characters">Control Characters</a></li> </ul> </li> <li><a class="reference internal" href="#confusable-characters-in-identifiers">Confusable Characters in Identifiers</a></li> <li><a class="reference internal" href="#confusable-digits">Confusable Digits</a></li> <li><a class="reference internal" href="#bidirectional-text">Bidirectional Text</a></li> <li><a class="reference internal" href="#bidirectional-marks-embeddings-overrides-and-isolates">Bidirectional Marks, Embeddings, Overrides and Isolates</a></li> <li><a class="reference internal" href="#normalizing-identifiers">Normalizing identifiers</a></li> <li><a class="reference internal" href="#source-encoding">Source Encoding</a></li> </ul> </li> <li><a class="reference internal" href="#open-issues">Open Issues</a></li> <li><a class="reference internal" href="#references">References</a></li> <li><a class="reference internal" href="#copyright">Copyright</a></li> </ul> </details></section> <section id="abstract"> <h2><a class="toc-backref" href="#abstract" role="doc-backlink">Abstract</a></h2> <p>This document explains possible ways to misuse Unicode to write Python programs that appear to do something else than they actually do.</p> <p>This document does not give any recommendations and solutions.</p> </section> <section id="introduction"> <h2><a class="toc-backref" href="#introduction" role="doc-backlink">Introduction</a></h2> <p><a class="reference external" href="https://home.unicode.org/">Unicode</a> is a system for handling all kinds of written language. It aims to allow any character from any human language to be used. Python code may consist of almost all valid Unicode characters. While this allows programmers from all around the world to express themselves, it also allows writing code that is potentially confusing to readers.</p> <p>It is possible to misuse Python’s Unicode-related features to write code that <em>appears</em> to do something else than what it does. Evildoers could take advantage of this to trick code reviewers into accepting malicious code.</p> <p>The possible issues generally can’t be solved in Python itself without excessive restrictions of the language. They should be solved in code editors and review tools (such as <em>diff</em> displays), by enforcing project-specific policies, and by raising awareness of individual programmers.</p> <p>This document purposefully does not give any solutions or recommendations: it is rather a list of things to keep in mind.</p> <p>This document is specific to Python. For general security considerations in Unicode text and source code, see Unicode technical reports <a class="reference internal" href="#tr36" id="id1"><span>[tr36]</span></a>, <a class="reference internal" href="#tr39" id="id2"><span>[tr39]</span></a>, and <a class="reference internal" href="#tr55" id="id3"><span>[tr55]</span></a>.</p> </section> <section id="acknowledgement"> <h2><a class="toc-backref" href="#acknowledgement" role="doc-backlink">Acknowledgement</a></h2> <p>Investigation for this document was prompted by <a class="reference external" href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574">CVE-2021-42574</a>, <em>Trojan Source Attacks</em>, reported by Nicholas Boucher and Ross Anderson, which focuses on Bidirectional override characters and homoglyphs in a variety of programming languages.</p> </section> <section id="confusing-features"> <h2><a class="toc-backref" href="#confusing-features" role="doc-backlink">Confusing Features</a></h2> <p>This section lists some Unicode-related features that can be surprising or misusable.</p> <section id="ascii-only-considerations"> <h3><a class="toc-backref" href="#ascii-only-considerations" role="doc-backlink">ASCII-only Considerations</a></h3> <p>ASCII is a subset of Unicode, consisting of the most common symbols, numbers, Latin letters and control characters.</p> <p>While issues with the ASCII character set are generally well understood, the’re presented here to help better understanding of the non-ASCII cases.</p> <section id="confusables-and-typos"> <h4><a class="toc-backref" href="#confusables-and-typos" role="doc-backlink">Confusables and Typos</a></h4> <p>Some characters look alike. Before the age of computers, many mechanical typewriters lacked the keys for the digits <code class="docutils literal notranslate"><span class="pre">0</span></code> and <code class="docutils literal notranslate"><span class="pre">1</span></code>: users typed <code class="docutils literal notranslate"><span class="pre">O</span></code> (capital o) and <code class="docutils literal notranslate"><span class="pre">l</span></code> (lowercase L) instead. Human readers could tell them apart by context only. In programming languages, however, distinction between digits and letters is critical – and most fonts designed for programmers make it easy to tell them apart.</p> <p>Similarly, in fonts designed for human languages, the uppercase “I” and lowercase “l” can look similar. Or the letters “rn” may be virtually indistinguishable from the single letter “m”. Again, programmers’ fonts make these pairs of <em>confusables</em> noticeably different.</p> <p>However, what is “noticeably” different always depends on the context. Humans tend to ignore details in longer identifiers: the variable name <code class="docutils literal notranslate"><span class="pre">accessibi1ity_options</span></code> can still look indistinguishable from <code class="docutils literal notranslate"><span class="pre">accessibility_options</span></code>, while they are distinct for the compiler. The same can be said for plain typos: most humans will not notice the typo in <code class="docutils literal notranslate"><span class="pre">responsbility_chain_delegate</span></code>.</p> </section> <section id="control-characters"> <h4><a class="toc-backref" href="#control-characters" role="doc-backlink">Control Characters</a></h4> <p>Python generally considers all <code class="docutils literal notranslate"><span class="pre">CR</span></code> (<code class="docutils literal notranslate"><span class="pre">\r</span></code>), <code class="docutils literal notranslate"><span class="pre">LF</span></code> (<code class="docutils literal notranslate"><span class="pre">\n</span></code>), and <code class="docutils literal notranslate"><span class="pre">CR-LF</span></code> pairs (<code class="docutils literal notranslate"><span class="pre">\r\n</span></code>) as an end of line characters. Most code editors do as well, but there are editors that display “non-native” line endings as unknown characters (or nothing at all), rather than ending the line, displaying this example:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># Don&#39;t call this function:</span> <span class="n">fire_the_missiles</span><span class="p">()</span> </pre></div> </div> <p>as a harmless comment like:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># Don&#39;t call this function:⬛fire_the_missiles()</span> </pre></div> </div> <p>CPython may treat the control character NUL (<code class="docutils literal notranslate"><span class="pre">\0</span></code>) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file.</p> <p>Some characters can be used to hide/overwrite other characters when source is listed in common terminals. For example:</p> <ul class="simple"> <li>BS (<code class="docutils literal notranslate"><span class="pre">\b</span></code>, Backspace) moves the cursor back, so the character after it will overwrite the character before.</li> <li>CR (<code class="docutils literal notranslate"><span class="pre">\r</span></code>, carriage return) moves the cursor to the start of line, subsequent characters overwrite the start of the line.</li> <li>SUB (<code class="docutils literal notranslate"><span class="pre">\x1A</span></code>, Ctrl+Z) means “End of text” on Windows. Some programs (such as <code class="docutils literal notranslate"><span class="pre">type</span></code>) ignore the rest of the file after it.</li> <li>ESC (<code class="docutils literal notranslate"><span class="pre">\x1B</span></code>) commonly initiates escape codes which allow arbitrary control of the terminal.</li> </ul> </section> </section> <section id="confusable-characters-in-identifiers"> <h3><a class="toc-backref" href="#confusable-characters-in-identifiers" role="doc-backlink">Confusable Characters in Identifiers</a></h3> <p>Python is not limited to ASCII. It allows characters of all scripts – Latin letters to ancient Egyptian hieroglyphs – in identifiers (such as variable names). See <a class="pep reference internal" href="../pep-3131/" title="PEP 3131 – Supporting Non-ASCII Identifiers">PEP 3131</a> for details and rationale. Only “letters and numbers” are allowed, so while <code class="docutils literal notranslate"><span class="pre">γάτα</span></code> is a valid Python identifier, <code class="docutils literal notranslate"><span class="pre">🐱</span></code> is not. (See <a class="reference external" href="https://docs.python.org/3/reference/lexical_analysis.html#identifiers">Identifiers and keywords</a> for details.)</p> <p>Non-printing control characters are also not allowed in identifiers.</p> <p>However, within the allowed set there is a large number of “confusables”. For example, the uppercase versions of the Latin <code class="docutils literal notranslate"><span class="pre">b</span></code>, Greek <code class="docutils literal notranslate"><span class="pre">β</span></code> (Beta), and Cyrillic <code class="docutils literal notranslate"><span class="pre">в</span></code> (Ve) often look identical: <code class="docutils literal notranslate"><span class="pre">B</span></code>, <code class="docutils literal notranslate"><span class="pre">Β</span></code> and <code class="docutils literal notranslate"><span class="pre">В</span></code>, respectively.</p> <p>This allows identifiers that look the same to humans, but not to Python. For example, all of the following are distinct identifiers:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">scope</span></code> (Latin, ASCII-only)</li> <li><code class="docutils literal notranslate"><span class="pre">scоpe</span></code> (with a Cyrillic <code class="docutils literal notranslate"><span class="pre">о</span></code>)</li> <li><code class="docutils literal notranslate"><span class="pre">scοpe</span></code> (with a Greek <code class="docutils literal notranslate"><span class="pre">ο</span></code>)</li> <li><code class="docutils literal notranslate"><span class="pre">ѕсоре</span></code> (all Cyrillic letters)</li> </ul> <p>Additionally, some letters can look like non-letters:</p> <ul class="simple"> <li>The letter for the Hawaiian <em>ʻokina</em> looks like an apostrophe; <code class="docutils literal notranslate"><span class="pre">ʻHelloʻ</span></code> is a Python identifier, not a string.</li> <li>The East Asian word for <em>ten</em> looks like a plus sign, so <code class="docutils literal notranslate"><span class="pre">十=</span> <span class="pre">10</span></code> is a complete Python statement. (The “十” is a word: “ten” rather than “10”.)</li> </ul> <div class="admonition note"> <p class="admonition-title">Note</p> <p>The converse also applies – some symbols look like letters – but since Python does not allow arbitrary symbols in identifiers, this is not an issue.</p> </div> </section> <section id="confusable-digits"> <h3><a class="toc-backref" href="#confusable-digits" role="doc-backlink">Confusable Digits</a></h3> <p>Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such as <code class="docutils literal notranslate"><span class="pre">.</span></code> or <code class="docutils literal notranslate"><span class="pre">e</span></code>).</p> <p>However, when numbers are converted from strings, such as in the <code class="docutils literal notranslate"><span class="pre">int</span></code> and <code class="docutils literal notranslate"><span class="pre">float</span></code> constructors or by the <code class="docutils literal notranslate"><span class="pre">str.format</span></code> method, any decimal digit can be used. For example <code class="docutils literal notranslate"><span class="pre">߅</span></code> (<code class="docutils literal notranslate"><span class="pre">NKO</span> <span class="pre">DIGIT</span> <span class="pre">FIVE</span></code>) or <code class="docutils literal notranslate"><span class="pre">௫</span></code> (<code class="docutils literal notranslate"><span class="pre">TAMIL</span> <span class="pre">DIGIT</span> <span class="pre">FIVE</span></code>) work as the digit <code class="docutils literal notranslate"><span class="pre">5</span></code>.</p> <p>Some scripts include digits that look similar to ASCII ones, but have a different value. For example:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">int</span><span class="p">(</span><span class="s1">&#39;৪୨&#39;</span><span class="p">)</span> <span class="go">42</span> <span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;</span><span class="si">{٥}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;zero&#39;</span><span class="p">,</span> <span class="s1">&#39;one&#39;</span><span class="p">,</span> <span class="s1">&#39;two&#39;</span><span class="p">,</span> <span class="s1">&#39;three&#39;</span><span class="p">,</span> <span class="s1">&#39;four&#39;</span><span class="p">,</span> <span class="s1">&#39;five&#39;</span><span class="p">)</span> <span class="go">five</span> </pre></div> </div> </section> <section id="bidirectional-text"> <h3><a class="toc-backref" href="#bidirectional-text" role="doc-backlink">Bidirectional Text</a></h3> <p>Some scripts, such as Hebrew or Arabic, are written right-to-left. Phrases in such scripts interact with nearby text in ways that can be surprising to people who aren’t familiar with these writing systems and their computer representation.</p> <p>The exact process is complicated, and explained in Unicode Standard Annex #9, <a class="reference external" href="http://www.unicode.org/reports/tr9/">Unicode Bidirectional Algorithm</a>.</p> <p>Consider the following code, which assigns a 100-character string to the variable <code class="docutils literal notranslate"><span class="pre">s</span></code>:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="s2">&quot;X&quot;</span> <span class="o">*</span> <span class="mi">100</span> <span class="c1"># &quot;X&quot; is assigned</span> </pre></div> </div> <p>When the <code class="docutils literal notranslate"><span class="pre">X</span></code> is replaced by the Hebrew letter <code class="docutils literal notranslate"><span class="pre">א</span></code>, the line becomes:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="s2">&quot;א&quot;</span> <span class="o">*</span> <span class="mi">100</span> <span class="c1"># &quot;א&quot; is assigned</span> </pre></div> </div> <p>This command still assigns a 100-character string to <code class="docutils literal notranslate"><span class="pre">s</span></code>, but when displayed as general text following the Bidirectional Algorithm (e.g. in a browser), it appears as <code class="docutils literal notranslate"><span class="pre">s</span> <span class="pre">=</span> <span class="pre">&quot;א&quot;</span></code> followed by a comment.</p> <p>Other surprising examples include:</p> <ul class="simple"> <li>In the statement <code class="docutils literal notranslate"><span class="pre">ערך</span> <span class="pre">=</span> <span class="pre">23</span></code>, the variable <code class="docutils literal notranslate"><span class="pre">ערך</span></code> is set to the integer 23.</li> <li>In the statement <code class="docutils literal notranslate"><span class="pre">قيمة</span> <span class="pre">=</span> <span class="pre">ערך</span></code>, the variable <code class="docutils literal notranslate"><span class="pre">قيمة</span></code> is set to the value of <code class="docutils literal notranslate"><span class="pre">ערך</span></code>.</li> <li>In the statement <code class="docutils literal notranslate"><span class="pre">قيمة</span> <span class="pre">-</span> <span class="pre">(ערך</span> <span class="pre">**</span> <span class="pre">2)</span></code>, the value of <code class="docutils literal notranslate"><span class="pre">ערך</span></code> is squared and then subtracted from <code class="docutils literal notranslate"><span class="pre">قيمة</span></code>. The <em>opening</em> parenthesis is displayed as <code class="docutils literal notranslate"><span class="pre">)</span></code>.</li> </ul> </section> <section id="bidirectional-marks-embeddings-overrides-and-isolates"> <h3><a class="toc-backref" href="#bidirectional-marks-embeddings-overrides-and-isolates" role="doc-backlink">Bidirectional Marks, Embeddings, Overrides and Isolates</a></h3> <p>Default reordering rules do not always yield the intended direction of text, so Unicode provides several ways to alter it.</p> <p>The most basic are <strong>directional marks</strong>, which are invisible but affect text as a left-to-right (or right-to-left) character would. Continuing with the <code class="docutils literal notranslate"><span class="pre">s</span> <span class="pre">=</span> <span class="pre">&quot;X&quot;</span></code> example above, in the next example the <code class="docutils literal notranslate"><span class="pre">X</span></code> is replaced by the Latin <code class="docutils literal notranslate"><span class="pre">x</span></code> followed or preceded by a right-to-left mark (<code class="docutils literal notranslate"><span class="pre">U+200F</span></code>). This assigns a 200-character string to <code class="docutils literal notranslate"><span class="pre">s</span></code> (100 copies of <code class="docutils literal notranslate"><span class="pre">x</span></code> interspersed with 100 invisible marks), but under Unicode rules for general text, it is rendered as <code class="docutils literal notranslate"><span class="pre">s</span> <span class="pre">=</span> <span class="pre">&quot;x&quot;</span></code> followed by an ASCII-only comment:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="s2">&quot;x‏&quot;</span> <span class="o">*</span> <span class="mi">100</span> <span class="c1"># &quot;‏x&quot; is assigned</span> </pre></div> </div> <p>The directional <strong>embedding</strong>, <strong>override</strong> and <strong>isolate</strong> characters are also invisible, but affect the ordering of all text after them until either ended by a dedicated character, or until the end of line. (Unicode specifies the effect to last until the end of a “paragraph” (see <a class="reference external" href="http://www.unicode.org/reports/tr9/">Unicode Bidirectional Algorithm</a>), but allows tools to interpret newline characters as paragraph ends (see Unicode <a class="reference external" href="http://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G10213">Newline Guidelines</a>). Most code editors and terminals do so.)</p> <p>These characters essentially allow arbitrary reordering of the text that follows them. Python only allows them in strings and comments, which does limit their potential (especially in combination with the fact that Python’s comments always extend to the end of a line), but it doesn’t render them harmless.</p> </section> <section id="normalizing-identifiers"> <h3><a class="toc-backref" href="#normalizing-identifiers" role="doc-backlink">Normalizing identifiers</a></h3> <p>Python strings are collections of <em>Unicode codepoints</em>, not “characters”.</p> <p>For reasons like compatibility with earlier encodings, Unicode often has several ways to encode what is essentially a single “character”. For example, all these are different ways of writing <code class="docutils literal notranslate"><span class="pre">Å</span></code> as a Python string, each of which is unequal to the others.</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">&quot;\N{LATIN</span> <span class="pre">CAPITAL</span> <span class="pre">LETTER</span> <span class="pre">A</span> <span class="pre">WITH</span> <span class="pre">RING</span> <span class="pre">ABOVE}&quot;</span></code> (1 codepoint)</li> <li><code class="docutils literal notranslate"><span class="pre">&quot;\N{LATIN</span> <span class="pre">CAPITAL</span> <span class="pre">LETTER</span> <span class="pre">A}\N{COMBINING</span> <span class="pre">RING</span> <span class="pre">ABOVE}&quot;</span></code> (2 codepoints)</li> <li><code class="docutils literal notranslate"><span class="pre">&quot;\N{ANGSTROM</span> <span class="pre">SIGN}&quot;</span></code> (1 codepoint, but different)</li> </ul> <p>For another example, the ligature <code class="docutils literal notranslate"><span class="pre">fi</span></code> has a dedicated Unicode codepoint, even though it has the same meaning as the two letters <code class="docutils literal notranslate"><span class="pre">fi</span></code>.</p> <p>Also, common letters frequently have several distinct variations. Unicode provides them for contexts where the difference has some semantic meaning, like mathematics. For example, some variations of <code class="docutils literal notranslate"><span class="pre">n</span></code> are:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">n</span></code> (LATIN SMALL LETTER N)</li> <li><code class="docutils literal notranslate"><span class="pre">𝐧</span></code> (MATHEMATICAL BOLD SMALL N)</li> <li><code class="docutils literal notranslate"><span class="pre">𝘯</span></code> (MATHEMATICAL SANS-SERIF ITALIC SMALL N)</li> <li><code class="docutils literal notranslate"><span class="pre">n</span></code> (FULLWIDTH LATIN SMALL LETTER N)</li> <li><code class="docutils literal notranslate"><span class="pre">ⁿ</span></code> (SUPERSCRIPT LATIN SMALL LETTER N)</li> </ul> <p>Unicode includes algorithms to <em>normalize</em> variants like these to a single form, and Python identifiers are normalized. (There are several normal forms; Python uses <code class="docutils literal notranslate"><span class="pre">NFKC</span></code>.)</p> <p>For example, <code class="docutils literal notranslate"><span class="pre">xn</span></code> and <code class="docutils literal notranslate"><span class="pre">xⁿ</span></code> are the same identifier in Python:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">xⁿ</span> <span class="o">=</span> <span class="mi">8</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">xn</span> <span class="go">8</span> </pre></div> </div> <p>… as is <code class="docutils literal notranslate"><span class="pre">fi</span></code> and <code class="docutils literal notranslate"><span class="pre">fi</span></code>, and as are the different ways to encode <code class="docutils literal notranslate"><span class="pre">Å</span></code>.</p> <p>This normalization applies <em>only</em> to identifiers, however. Functions that treat strings as identifiers, such as <code class="docutils literal notranslate"><span class="pre">getattr</span></code>, do not perform normalization:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="k">class</span><span class="w"> </span><span class="nc">Test</span><span class="p">:</span> <span class="gp">... </span> <span class="k">def</span><span class="w"> </span><span class="nf">finalize</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="gp">... </span> <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;OK&#39;</span><span class="p">)</span> <span class="gp">...</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">Test</span><span class="p">()</span><span class="o">.</span><span class="n">finalize</span><span class="p">()</span> <span class="go">OK</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">Test</span><span class="p">()</span><span class="o">.</span><span class="n">finalize</span><span class="p">()</span> <span class="go">OK</span> <span class="gp">&gt;&gt;&gt; </span><span class="nb">getattr</span><span class="p">(</span><span class="n">Test</span><span class="p">(),</span> <span class="s1">&#39;finalize&#39;</span><span class="p">)</span> <span class="gt">Traceback (most recent call last):</span> <span class="w"> </span><span class="c">...</span> <span class="gr">AttributeError</span>: <span class="n">&#39;Test&#39; object has no attribute &#39;finalize&#39;</span> </pre></div> </div> <p>This also applies when importing:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">import</span> <span class="pre">finalization</span></code> performs normalization, and looks for a file named <code class="docutils literal notranslate"><span class="pre">finalization.py</span></code> (and other <code class="docutils literal notranslate"><span class="pre">finalization.*</span></code> files).</li> <li><code class="docutils literal notranslate"><span class="pre">importlib.import_module(&quot;finalization&quot;)</span></code> does not normalize, so it looks for a file named <code class="docutils literal notranslate"><span class="pre">finalization.py</span></code>.</li> </ul> <p>Some filesystems independently apply normalization and/or case folding. On some systems, <code class="docutils literal notranslate"><span class="pre">finalization.py</span></code>, <code class="docutils literal notranslate"><span class="pre">finalization.py</span></code> and <code class="docutils literal notranslate"><span class="pre">FINALIZATION.py</span></code> are three distinct filenames; on others, some or all of these name the same file.</p> </section> <section id="source-encoding"> <h3><a class="toc-backref" href="#source-encoding" role="doc-backlink">Source Encoding</a></h3> <p>The encoding of Python source files is given by a specific regex on the first two lines of a file, as per <a class="reference external" href="https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations">Encoding declarations</a>. This mechanism is very liberal in what it accepts, and thus easy to obfuscate.</p> <p>This can be misused in combination with Python-specific special-purpose encodings (see <a class="reference external" href="https://docs.python.org/3/library/codecs.html#text-encodings">Text Encodings</a>). For example, with <code class="docutils literal notranslate"><span class="pre">encoding:</span> <span class="pre">unicode_escape</span></code>, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># For writing Japanese, you don&#39;t need an editor that supports</span> <span class="c1"># UTF-8 source encoding: unicode_escape sequences work just as well.</span> <span class="kn">import</span><span class="w"> </span><span class="nn">os</span> <span class="n">message</span> <span class="o">=</span> <span class="s1">&#39;&#39;&#39;</span> <span class="s1">This is &quot;Hello World&quot; in Japanese:</span> <span class="se">\u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c</span> <span class="s1">This runs `echo WHOA` in your shell:</span> <span class="se">\u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e</span> <span class="se">\u0073\u0079\u0073\u0074\u0065\u006d\u0028</span> <span class="se">\u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027</span> <span class="se">\u0029\u0029\u002c\u0027\u0027\u0027</span> <span class="s1">&#39;&#39;&#39;</span> </pre></div> </div> <p>Here, <code class="docutils literal notranslate"><span class="pre">encoding:</span> <span class="pre">unicode_escape</span></code> in the initial comment is an encoding declaration. The <code class="docutils literal notranslate"><span class="pre">unicode_escape</span></code> encoding instructs Python to treat <code class="docutils literal notranslate"><span class="pre">\u0027</span></code> as a single quote (which can start/end a string), <code class="docutils literal notranslate"><span class="pre">\u002c</span></code> as a comma (punctuator), etc.</p> </section> </section> <section id="open-issues"> <h2><a class="toc-backref" href="#open-issues" role="doc-backlink">Open Issues</a></h2> <p>We should probably write and publish:</p> <ul class="simple"> <li>Recommendations for Text Editors and Code Tools</li> <li>Recommendations for Programmers and Teams</li> <li>Possible Improvements in Python</li> </ul> </section> <section id="references"> <h2><a class="toc-backref" href="#references" role="doc-backlink">References</a></h2> <div role="list" class="citation-list"> <div class="citation" id="tr36" role="doc-biblioentry"> <dt class="label" id="tr36">[<a href="#id1">tr36</a>]</dt> <dd>Unicode Technical Report #36: Unicode Security Considerations <a class="reference external" href="http://www.unicode.org/reports/tr36/">http://www.unicode.org/reports/tr36/</a></div> <div class="citation" id="tr39" role="doc-biblioentry"> <dt class="label" id="tr39">[<a href="#id2">tr39</a>]</dt> <dd>Unicode® Technical Standard #39: Unicode Security Mechanisms <a class="reference external" href="http://www.unicode.org/reports/tr39/">http://www.unicode.org/reports/tr39/</a></div> <div class="citation" id="tr55" role="doc-biblioentry"> <dt class="label" id="tr55">[<a href="#id3">tr55</a>]</dt> <dd>Unicode Technical Report #55: Unicode Source Code Handling <a class="reference external" href="http://www.unicode.org/reports/tr55/">http://www.unicode.org/reports/tr55/</a></div> </div> </section> <section id="copyright"> <h2><a class="toc-backref" href="#copyright" role="doc-backlink">Copyright</a></h2> <p>This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.</p> </section> </section> <hr class="docutils" /> <p>Source: <a class="reference external" href="https://github.com/python/peps/blob/main/peps/pep-0672.rst">https://github.com/python/peps/blob/main/peps/pep-0672.rst</a></p> <p>Last modified: <a class="reference external" href="https://github.com/python/peps/commits/main/peps/pep-0672.rst">2024-11-26 10:14:21 GMT</a></p> </article> <nav id="pep-sidebar"> <h2>Contents</h2> <ul> <li><a class="reference internal" href="#abstract">Abstract</a></li> <li><a class="reference internal" href="#introduction">Introduction</a></li> <li><a class="reference internal" href="#acknowledgement">Acknowledgement</a></li> <li><a class="reference internal" href="#confusing-features">Confusing Features</a><ul> <li><a class="reference internal" href="#ascii-only-considerations">ASCII-only Considerations</a><ul> <li><a class="reference internal" href="#confusables-and-typos">Confusables and Typos</a></li> <li><a class="reference internal" href="#control-characters">Control Characters</a></li> </ul> </li> <li><a class="reference internal" href="#confusable-characters-in-identifiers">Confusable Characters in Identifiers</a></li> <li><a class="reference internal" href="#confusable-digits">Confusable Digits</a></li> <li><a class="reference internal" href="#bidirectional-text">Bidirectional Text</a></li> <li><a class="reference internal" href="#bidirectional-marks-embeddings-overrides-and-isolates">Bidirectional Marks, Embeddings, Overrides and Isolates</a></li> <li><a class="reference internal" href="#normalizing-identifiers">Normalizing identifiers</a></li> <li><a class="reference internal" href="#source-encoding">Source Encoding</a></li> </ul> </li> <li><a class="reference internal" href="#open-issues">Open Issues</a></li> <li><a class="reference internal" href="#references">References</a></li> <li><a class="reference internal" href="#copyright">Copyright</a></li> </ul> <br> <a id="source" href="https://github.com/python/peps/blob/main/peps/pep-0672.rst">Page Source (GitHub)</a> </nav> </section> <script src="../_static/colour_scheme.js"></script> <script src="../_static/wrap_tables.js"></script> <script src="../_static/sticky_banner.js"></script> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10