CINXE.COM
PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions | peps.python.org
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="color-scheme" content="light dark"> <title>PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions | peps.python.org</title> <link rel="shortcut icon" href="../_static/py.png"> <link rel="canonical" href="https://peps.python.org/pep-0756/"> <link rel="stylesheet" href="../_static/style.css" type="text/css"> <link rel="stylesheet" href="../_static/mq.css" type="text/css"> <link rel="stylesheet" href="../_static/pygments.css" type="text/css" media="(prefers-color-scheme: light)" id="pyg-light"> <link rel="stylesheet" href="../_static/pygments_dark.css" type="text/css" media="(prefers-color-scheme: dark)" id="pyg-dark"> <link rel="alternate" type="application/rss+xml" title="Latest PEPs" href="https://peps.python.org/peps.rss"> <meta property="og:title" content='PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions | peps.python.org'> <meta property="og:description" content="Add functions to the limited C API version 3.14:"> <meta property="og:type" content="website"> <meta property="og:url" content="https://peps.python.org/pep-0756/"> <meta property="og:site_name" content="Python Enhancement Proposals (PEPs)"> <meta property="og:image" content="https://peps.python.org/_static/og-image.png"> <meta property="og:image:alt" content="Python PEPs"> <meta property="og:image:width" content="200"> <meta property="og:image:height" content="200"> <meta name="description" content="Add functions to the limited C API version 3.14:"> <meta name="theme-color" content="#3776ab"> </head> <body> <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"> <symbol id="svg-sun-half" viewBox="0 0 24 24" pointer-events="all"> <title>Following system colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <circle cx="12" cy="12" r="9"></circle> <path d="M12 3v18m0-12l4.65-4.65M12 14.3l7.37-7.37M12 19.6l8.85-8.85"></path> </svg> </symbol> <symbol id="svg-moon" viewBox="0 0 24 24" pointer-events="all"> <title>Selected dark colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <path stroke="none" d="M0 0h24v24H0z" fill="none"></path> <path d="M12 3c.132 0 .263 0 .393 0a7.5 7.5 0 0 0 7.92 12.446a9 9 0 1 1 -8.313 -12.454z"></path> </svg> </symbol> <symbol id="svg-sun" viewBox="0 0 24 24" pointer-events="all"> <title>Selected light colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <circle cx="12" cy="12" r="5"></circle> <line x1="12" y1="1" x2="12" y2="3"></line> <line x1="12" y1="21" x2="12" y2="23"></line> <line x1="4.22" y1="4.22" x2="5.64" y2="5.64"></line> <line x1="18.36" y1="18.36" x2="19.78" y2="19.78"></line> <line x1="1" y1="12" x2="3" y2="12"></line> <line x1="21" y1="12" x2="23" y2="12"></line> <line x1="4.22" y1="19.78" x2="5.64" y2="18.36"></line> <line x1="18.36" y1="5.64" x2="19.78" y2="4.22"></line> </svg> </symbol> </svg> <script> document.documentElement.dataset.colour_scheme = localStorage.getItem("colour_scheme") || "auto" </script> <section id="pep-page-section"> <header> <h1>Python Enhancement Proposals</h1> <ul class="breadcrumbs"> <li><a href="https://www.python.org/" title="The Python Programming Language">Python</a> » </li> <li><a href="../pep-0000/">PEP Index</a> » </li> <li>PEP 756</li> </ul> <button id="colour-scheme-cycler" onClick="setColourScheme(nextColourScheme())"> <svg aria-hidden="true" class="colour-scheme-icon-when-auto"><use href="#svg-sun-half"></use></svg> <svg aria-hidden="true" class="colour-scheme-icon-when-dark"><use href="#svg-moon"></use></svg> <svg aria-hidden="true" class="colour-scheme-icon-when-light"><use href="#svg-sun"></use></svg> <span class="visually-hidden">Toggle light / dark / auto colour theme</span> </button> </header> <article> <section id="pep-content"> <h1 class="page-title">PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions</h1> <dl class="rfc2822 field-list simple"> <dt class="field-odd">Author<span class="colon">:</span></dt> <dd class="field-odd">Victor Stinner <vstinner at python.org></dd> <dt class="field-even">PEP-Delegate<span class="colon">:</span></dt> <dd class="field-even">C API Working Group</dd> <dt class="field-odd">Discussions-To<span class="colon">:</span></dt> <dd class="field-odd"><a class="reference external" href="https://discuss.python.org/t/63891">Discourse thread</a></dd> <dt class="field-even">Status<span class="colon">:</span></dt> <dd class="field-even"><abbr title="Removed from consideration by sponsor or authors">Withdrawn</abbr></dd> <dt class="field-odd">Type<span class="colon">:</span></dt> <dd class="field-odd"><abbr title="Normative PEP with a new feature for Python, implementation change for CPython or interoperability standard for the ecosystem">Standards Track</abbr></dd> <dt class="field-even">Created<span class="colon">:</span></dt> <dd class="field-even">13-Sep-2024</dd> <dt class="field-odd">Python-Version<span class="colon">:</span></dt> <dd class="field-odd">3.14</dd> <dt class="field-even">Post-History<span class="colon">:</span></dt> <dd class="field-even"><a class="reference external" href="https://discuss.python.org/t/63891" title="Discourse thread">14-Sep-2024</a></dd> <dt class="field-odd">Resolution<span class="colon">:</span></dt> <dd class="field-odd"><a class="reference external" href="https://discuss.python.org/t/63891/62">29-Oct-2024</a></dd> </dl> <hr class="docutils" /> <section id="contents"> <details><summary>Table of Contents</summary><ul class="simple"> <li><a class="reference internal" href="#abstract">Abstract</a></li> <li><a class="reference internal" href="#rationale">Rationale</a><ul> <li><a class="reference internal" href="#pep-393">PEP 393</a></li> <li><a class="reference internal" href="#limited-c-api">Limited C API</a></li> </ul> </li> <li><a class="reference internal" href="#specification">Specification</a><ul> <li><a class="reference internal" href="#api">API</a></li> <li><a class="reference internal" href="#pyunicode-export">PyUnicode_Export()</a></li> <li><a class="reference internal" href="#export-complexity">Export complexity</a></li> <li><a class="reference internal" href="#py-buffer-format-and-item-size">Py_buffer format and item size</a></li> <li><a class="reference internal" href="#pyunicode-import">PyUnicode_Import()</a></li> <li><a class="reference internal" href="#utf-8-format">UTF-8 format</a></li> <li><a class="reference internal" href="#ascii-format">ASCII format</a></li> <li><a class="reference internal" href="#surrogate-characters-and-embedded-nul-characters">Surrogate characters and embedded NUL characters</a></li> </ul> </li> <li><a class="reference internal" href="#implementation">Implementation</a></li> <li><a class="reference internal" href="#backwards-compatibility">Backwards Compatibility</a></li> <li><a class="reference internal" href="#usage-of-pep-393-c-apis">Usage of PEP 393 C APIs</a><ul> <li><a class="reference internal" href="#pyunicode-fromkindanddata">PyUnicode_FromKindAndData()</a></li> <li><a class="reference internal" href="#pyunicode-4byte-data">PyUnicode_4BYTE_DATA()</a></li> </ul> </li> <li><a class="reference internal" href="#rejected-ideas">Rejected Ideas</a><ul> <li><a class="reference internal" href="#reject-embedded-nul-characters-and-require-trailing-nul-character">Reject embedded NUL characters and require trailing NUL character</a></li> <li><a class="reference internal" href="#reject-surrogate-characters">Reject surrogate characters</a></li> <li><a class="reference internal" href="#conversions-on-demand">Conversions on demand</a></li> <li><a class="reference internal" href="#export-to-utf-8">Export to UTF-8</a></li> </ul> </li> <li><a class="reference internal" href="#discussions">Discussions</a></li> <li><a class="reference internal" href="#copyright">Copyright</a></li> </ul> </details></section> <section id="abstract"> <h2><a class="toc-backref" href="#abstract" role="doc-backlink">Abstract</a></h2> <p>Add functions to the limited C API version 3.14:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_Export()</span></code>: export a Python str object as a <code class="docutils literal notranslate"><span class="pre">Py_buffer</span></code> view.</li> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_Import()</span></code>: import a Python str object.</li> </ul> <p>On CPython, <code class="docutils literal notranslate"><span class="pre">PyUnicode_Export()</span></code> has an <em>O</em>(1) complexity: no memory is copied and no conversion is done.</p> </section> <section id="rationale"> <h2><a class="toc-backref" href="#rationale" role="doc-backlink">Rationale</a></h2> <section id="pep-393"> <h3><a class="toc-backref" href="#pep-393" role="doc-backlink">PEP 393</a></h3> <p><a class="pep reference internal" href="../pep-0393/" title="PEP 393 – Flexible String Representation">PEP 393</a> “Flexible String Representation” changed string internals in Python 3.3 to use three formats:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_1BYTE_KIND</span></code>: Unicode range [U+0000; U+00ff], UCS-1, 1 byte/character.</li> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_2BYTE_KIND</span></code>: Unicode range [U+0000; U+ffff], UCS-2, 2 bytes/character.</li> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_4BYTE_KIND</span></code>: Unicode range [U+0000; U+10ffff], UCS-4, 4 bytes/character.</li> </ul> <p>A Python <code class="docutils literal notranslate"><span class="pre">str</span></code> object must always use the most compact format. For example, a string which only contains ASCII characters must use the UCS-1 format.</p> <p>The <code class="docutils literal notranslate"><span class="pre">PyUnicode_KIND()</span></code> function can be used to know the format used by a string.</p> <p>One of the following functions can be used to access data:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_1BYTE_DATA()</span></code> for <code class="docutils literal notranslate"><span class="pre">PyUnicode_1BYTE_KIND</span></code>.</li> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_2BYTE_DATA()</span></code> for <code class="docutils literal notranslate"><span class="pre">PyUnicode_2BYTE_KIND</span></code>.</li> <li><code class="docutils literal notranslate"><span class="pre">PyUnicode_4BYTE_DATA()</span></code> for <code class="docutils literal notranslate"><span class="pre">PyUnicode_4BYTE_KIND</span></code>.</li> </ul> <p>To get the best performance, a C extension should have 3 code paths for each of these 3 string native formats.</p> </section> <section id="limited-c-api"> <h3><a class="toc-backref" href="#limited-c-api" role="doc-backlink">Limited C API</a></h3> <p><a class="pep reference internal" href="../pep-0393/" title="PEP 393 – Flexible String Representation">PEP 393</a> functions such as <code class="docutils literal notranslate"><span class="pre">PyUnicode_KIND()</span></code> and <code class="docutils literal notranslate"><span class="pre">PyUnicode_1BYTE_DATA()</span></code> are excluded from the limited C API. It’s not possible to write code specialized for UCS formats. A C extension using the limited C API can only use less efficient code paths and string formats.</p> <p>For example, the <a class="reference external" href="https://markupsafe.palletsprojects.com/">MarkupSafe project</a> has a C extension specialized for UCS formats for best performance, and so cannot use the limited C API.</p> </section> </section> <section id="specification"> <h2><a class="toc-backref" href="#specification" role="doc-backlink">Specification</a></h2> <section id="api"> <h3><a class="toc-backref" href="#api" role="doc-backlink">API</a></h3> <p>Add the following API to the limited C API version 3.14:</p> <div class="highlight-c notranslate"><div class="highlight"><pre><span></span><span class="kt">int32_t</span><span class="w"> </span><span class="nf">PyUnicode_Export</span><span class="p">(</span> <span class="w"> </span><span class="n">PyObject</span><span class="w"> </span><span class="o">*</span><span class="n">unicode</span><span class="p">,</span> <span class="w"> </span><span class="kt">int32_t</span><span class="w"> </span><span class="n">requested_formats</span><span class="p">,</span> <span class="w"> </span><span class="n">Py_buffer</span><span class="w"> </span><span class="o">*</span><span class="n">view</span><span class="p">);</span> <span class="n">PyObject</span><span class="o">*</span><span class="w"> </span><span class="nf">PyUnicode_Import</span><span class="p">(</span> <span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="w"> </span><span class="n">Py_ssize_t</span><span class="w"> </span><span class="n">nbytes</span><span class="p">,</span> <span class="w"> </span><span class="kt">int32_t</span><span class="w"> </span><span class="n">format</span><span class="p">);</span> <span class="cp">#define PyUnicode_FORMAT_UCS1 0x01 </span><span class="c1">// Py_UCS1*</span> <span class="cp">#define PyUnicode_FORMAT_UCS2 0x02 </span><span class="c1">// Py_UCS2*</span> <span class="cp">#define PyUnicode_FORMAT_UCS4 0x04 </span><span class="c1">// Py_UCS4*</span> <span class="cp">#define PyUnicode_FORMAT_UTF8 0x08 </span><span class="c1">// char*</span> <span class="cp">#define PyUnicode_FORMAT_ASCII 0x10 </span><span class="c1">// char* (ASCII string)</span> </pre></div> </div> <p>The <code class="docutils literal notranslate"><span class="pre">int32_t</span></code> type is used instead of <code class="docutils literal notranslate"><span class="pre">int</span></code> to have a well defined type size and not depend on the platform or the compiler. See <a class="reference external" href="https://github.com/capi-workgroup/api-evolution/issues/10">Avoid C-specific Types</a> for the longer rationale.</p> </section> <section id="pyunicode-export"> <h3><a class="toc-backref" href="#pyunicode-export" role="doc-backlink">PyUnicode_Export()</a></h3> <p>API:</p> <div class="highlight-c notranslate"><div class="highlight"><pre><span></span><span class="kt">int32_t</span><span class="w"> </span><span class="n">PyUnicode_Export</span><span class="p">(</span> <span class="w"> </span><span class="n">PyObject</span><span class="w"> </span><span class="o">*</span><span class="n">unicode</span><span class="p">,</span> <span class="w"> </span><span class="kt">int32_t</span><span class="w"> </span><span class="n">requested_formats</span><span class="p">,</span> <span class="w"> </span><span class="n">Py_buffer</span><span class="w"> </span><span class="o">*</span><span class="n">view</span><span class="p">)</span> </pre></div> </div> <p>Export the contents of the <em>unicode</em> string in one of the <em>requested_formats</em>.</p> <ul class="simple"> <li>On success, fill <em>view</em>, and return a format (greater than <code class="docutils literal notranslate"><span class="pre">0</span></code>).</li> <li>On error, set an exception, and return <code class="docutils literal notranslate"><span class="pre">-1</span></code>. <em>view</em> is left unchanged.</li> </ul> <p>After a successful call to <code class="docutils literal notranslate"><span class="pre">PyUnicode_Export()</span></code>, the <em>view</em> buffer must be released by <code class="docutils literal notranslate"><span class="pre">PyBuffer_Release()</span></code>. The contents of the buffer are valid until they are released.</p> <p>The buffer is read-only and must not be modified.</p> <p>The <code class="docutils literal notranslate"><span class="pre">view->len</span></code> member must be used to get the string length. The buffer should end with a trailing NUL character, but it’s not recommended to rely on that because of embedded NUL characters.</p> <p><em>unicode</em> and <em>view</em> must not be NULL.</p> <p>Available formats:</p> <table class="docutils align-default"> <thead> <tr class="row-odd"><th class="head">Constant Identifier</th> <th class="head">Value</th> <th class="head">Description</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UCS1</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">0x01</span></code></td> <td>UCS-1 string (<code class="docutils literal notranslate"><span class="pre">Py_UCS1*</span></code>)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UCS2</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">0x02</span></code></td> <td>UCS-2 string (<code class="docutils literal notranslate"><span class="pre">Py_UCS2*</span></code>)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UCS4</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">0x04</span></code></td> <td>UCS-4 string (<code class="docutils literal notranslate"><span class="pre">Py_UCS4*</span></code>)</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UTF8</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">0x08</span></code></td> <td>UTF-8 string (<code class="docutils literal notranslate"><span class="pre">char*</span></code>)</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_ASCII</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">0x10</span></code></td> <td>ASCII string (<code class="docutils literal notranslate"><span class="pre">Py_UCS1*</span></code>)</td> </tr> </tbody> </table> <p>UCS-2 and UCS-4 use the native byte order.</p> <p><em>requested_formats</em> can be a single format or a bitwise combination of the formats in the table above. On success, the returned format will be set to a single one of the requested formats.</p> <p>Note that future versions of Python may introduce additional formats.</p> <p>No memory is copied and no conversion is done.</p> </section> <section id="export-complexity"> <span id="id1"></span><h3><a class="toc-backref" href="#export-complexity" role="doc-backlink">Export complexity</a></h3> <p>On CPython, an export has a complexity of <em>O</em>(1): no memory is copied and no conversion is done.</p> <p>To get the best performance on CPython and PyPy, it’s recommended to support these 4 formats:</p> <div class="highlight-c notranslate"><div class="highlight"><pre><span></span><span class="p">(</span><span class="n">PyUnicode_FORMAT_UCS1</span><span class="w"> </span>\ <span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">PyUnicode_FORMAT_UCS2</span><span class="w"> </span>\ <span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">PyUnicode_FORMAT_UCS4</span><span class="w"> </span>\ <span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">PyUnicode_FORMAT_UTF8</span><span class="p">)</span> </pre></div> </div> <p>PyPy uses UTF-8 natively and so the <code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UTF8</span></code> format is recommended. It requires a memory copy, since PyPy <code class="docutils literal notranslate"><span class="pre">str</span></code> objects can be moved in memory (PyPy uses a moving garbage collector).</p> </section> <section id="py-buffer-format-and-item-size"> <h3><a class="toc-backref" href="#py-buffer-format-and-item-size" role="doc-backlink">Py_buffer format and item size</a></h3> <p><code class="docutils literal notranslate"><span class="pre">Py_buffer</span></code> uses the following format and item size depending on the export format:</p> <table class="docutils align-default"> <thead> <tr class="row-odd"><th class="head">Export format</th> <th class="head">Buffer format</th> <th class="head">Item size</th> </tr> </thead> <tbody> <tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UCS1</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">"B"</span></code></td> <td>1 byte</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UCS2</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">"=H"</span></code></td> <td>2 bytes</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UCS4</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">"=I"</span></code></td> <td>4 bytes</td> </tr> <tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UTF8</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">"B"</span></code></td> <td>1 byte</td> </tr> <tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_ASCII</span></code></td> <td><code class="docutils literal notranslate"><span class="pre">"B"</span></code></td> <td>1 byte</td> </tr> </tbody> </table> </section> <section id="pyunicode-import"> <h3><a class="toc-backref" href="#pyunicode-import" role="doc-backlink">PyUnicode_Import()</a></h3> <p>API:</p> <div class="highlight-c notranslate"><div class="highlight"><pre><span></span><span class="n">PyObject</span><span class="o">*</span><span class="w"> </span><span class="n">PyUnicode_Import</span><span class="p">(</span> <span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="w"> </span><span class="n">Py_ssize_t</span><span class="w"> </span><span class="n">nbytes</span><span class="p">,</span> <span class="w"> </span><span class="kt">int32_t</span><span class="w"> </span><span class="n">format</span><span class="p">)</span> </pre></div> </div> <p>Create a Unicode string object from a buffer in a supported format.</p> <ul class="simple"> <li>Return a reference to a new string object on success.</li> <li>Set an exception and return <code class="docutils literal notranslate"><span class="pre">NULL</span></code> on error.</li> </ul> <p><em>data</em> must not be NULL. <em>nbytes</em> must be positive or zero.</p> <p>See <code class="docutils literal notranslate"><span class="pre">PyUnicode_Export()</span></code> for the available formats.</p> </section> <section id="utf-8-format"> <h3><a class="toc-backref" href="#utf-8-format" role="doc-backlink">UTF-8 format</a></h3> <p>CPython 3.14 doesn’t use the UTF-8 format internally and doesn’t support exporting a string as UTF-8. The <code class="docutils literal notranslate"><span class="pre">PyUnicode_AsUTF8AndSize()</span></code> function can be used instead.</p> <p>The <code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UTF8</span></code> format is provided for compatibility with alternate implementations which may use UTF-8 natively for strings.</p> </section> <section id="ascii-format"> <h3><a class="toc-backref" href="#ascii-format" role="doc-backlink">ASCII format</a></h3> <p>When the <code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_ASCII</span></code> format is request for export, the <code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_UCS1</span></code> export format is used for ASCII strings.</p> <p>The <code class="docutils literal notranslate"><span class="pre">PyUnicode_FORMAT_ASCII</span></code> format is mostly useful for <code class="docutils literal notranslate"><span class="pre">PyUnicode_Import()</span></code> to validate that a string only contains ASCII characters.</p> </section> <section id="surrogate-characters-and-embedded-nul-characters"> <h3><a class="toc-backref" href="#surrogate-characters-and-embedded-nul-characters" role="doc-backlink">Surrogate characters and embedded NUL characters</a></h3> <p>Surrogate characters are allowed: they can be imported and exported.</p> <p>Embedded NUL characters are allowed: they can be imported and exported.</p> </section> </section> <section id="implementation"> <h2><a class="toc-backref" href="#implementation" role="doc-backlink">Implementation</a></h2> <p><a class="reference external" href="https://github.com/python/cpython/pull/123738">https://github.com/python/cpython/pull/123738</a></p> </section> <section id="backwards-compatibility"> <h2><a class="toc-backref" href="#backwards-compatibility" role="doc-backlink">Backwards Compatibility</a></h2> <p>There is no impact on the backward compatibility, only new C API functions are added.</p> </section> <section id="usage-of-pep-393-c-apis"> <h2><a class="toc-backref" href="#usage-of-pep-393-c-apis" role="doc-backlink">Usage of PEP 393 C APIs</a></h2> <p>A code search on PyPI top 7,500 projects (in March 2024) shows that there are many projects importing and exporting UCS formats with the regular C API.</p> <section id="pyunicode-fromkindanddata"> <h3><a class="toc-backref" href="#pyunicode-fromkindanddata" role="doc-backlink">PyUnicode_FromKindAndData()</a></h3> <p>25 projects call <code class="docutils literal notranslate"><span class="pre">PyUnicode_FromKindAndData()</span></code>:</p> <ul class="simple"> <li><strong>Cython</strong> (3.0.9)</li> <li>Levenshtein (0.25.0)</li> <li>PyICU (2.12)</li> <li>PyICU-binary (2.7.4)</li> <li>PyQt5 (5.15.10)</li> <li>PyQt6 (6.6.1)</li> <li>aiocsv (1.3.1)</li> <li>asyncpg (0.29.0)</li> <li>biopython (1.83)</li> <li>catboost (1.2.3)</li> <li>cffi (1.16.0)</li> <li>mojimoji (0.0.13)</li> <li>mwparserfromhell (0.6.6)</li> <li>numba (0.59.0)</li> <li><strong>numpy</strong> (1.26.4)</li> <li>orjson (3.9.15)</li> <li>pemja (0.4.1)</li> <li>pyahocorasick (2.0.0)</li> <li>pyjson5 (1.6.6)</li> <li>rapidfuzz (3.6.2)</li> <li>regex (2023.12.25)</li> <li>srsly (2.4.8)</li> <li>tokenizers (0.15.2)</li> <li>ujson (5.9.0)</li> <li>unicodedata2 (15.1.0)</li> </ul> </section> <section id="pyunicode-4byte-data"> <h3><a class="toc-backref" href="#pyunicode-4byte-data" role="doc-backlink">PyUnicode_4BYTE_DATA()</a></h3> <p>21 projects call <code class="docutils literal notranslate"><span class="pre">PyUnicode_2BYTE_DATA()</span></code> and/or <code class="docutils literal notranslate"><span class="pre">PyUnicode_4BYTE_DATA()</span></code>:</p> <ul class="simple"> <li><strong>Cython</strong> (3.0.9)</li> <li><strong>MarkupSafe</strong> (2.1.5)</li> <li>Nuitka (2.1.2)</li> <li>PyICU (2.12)</li> <li>PyICU-binary (2.7.4)</li> <li>PyQt5_sip (12.13.0)</li> <li>PyQt6_sip (13.6.0)</li> <li>biopython (1.83)</li> <li>catboost (1.2.3)</li> <li>cement (3.0.10)</li> <li>cffi (1.16.0)</li> <li>duckdb (0.10.0)</li> <li><strong>mypy</strong> (1.9.0)</li> <li><strong>numpy</strong> (1.26.4)</li> <li>orjson (3.9.15)</li> <li>pemja (0.4.1)</li> <li>pyahocorasick (2.0.0)</li> <li>pyjson5 (1.6.6)</li> <li>pyobjc-core (10.2)</li> <li>sip (6.8.3)</li> <li>wxPython (4.2.1)</li> </ul> </section> </section> <section id="rejected-ideas"> <h2><a class="toc-backref" href="#rejected-ideas" role="doc-backlink">Rejected Ideas</a></h2> <section id="reject-embedded-nul-characters-and-require-trailing-nul-character"> <h3><a class="toc-backref" href="#reject-embedded-nul-characters-and-require-trailing-nul-character" role="doc-backlink">Reject embedded NUL characters and require trailing NUL character</a></h3> <p>In C, it’s convenient to have a trailing NUL character. For example, the <code class="docutils literal notranslate"><span class="pre">for</span> <span class="pre">(;</span> <span class="pre">*str</span> <span class="pre">!=</span> <span class="pre">0;</span> <span class="pre">str++)</span></code> loop can be used to iterate on characters and <code class="docutils literal notranslate"><span class="pre">strlen()</span></code> can be used to get a string length.</p> <p>The problem is that a Python <code class="docutils literal notranslate"><span class="pre">str</span></code> object can embed NUL characters. Example: <code class="docutils literal notranslate"><span class="pre">"ab\0c"</span></code>. If a string contains an embedded NUL character, code relying on the NUL character to find the string end truncates the string. It can lead to bugs, or even security vulnerabilities. See a previous discussion in the issue <a class="reference external" href="https://github.com/python/cpython/issues/111089">Change PyUnicode_AsUTF8() to return NULL on embedded null characters</a>.</p> <p>Rejecting embedded NUL characters require to scan the string which has an <em>O</em>(<em>n</em>) complexity.</p> </section> <section id="reject-surrogate-characters"> <h3><a class="toc-backref" href="#reject-surrogate-characters" role="doc-backlink">Reject surrogate characters</a></h3> <p>Surrogate characters are characters in the Unicode range [U+D800; U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python <code class="docutils literal notranslate"><span class="pre">str</span></code> object can contain arbitrary lone surrogate characters. Example: <code class="docutils literal notranslate"><span class="pre">"\uDC80"</span></code>.</p> <p>Rejecting surrogate characters prevents exporting a string which contains such a character. It can be surprising and annoying since the <code class="docutils literal notranslate"><span class="pre">PyUnicode_Export()</span></code> caller doesn’t control the string contents.</p> <p>Allowing surrogate characters allows to export any string and so avoid this issue. For example, the UTF-8 codec can be used with the <code class="docutils literal notranslate"><span class="pre">surrogatepass</span></code> error handler to encode and decode surrogate characters.</p> </section> <section id="conversions-on-demand"> <h3><a class="toc-backref" href="#conversions-on-demand" role="doc-backlink">Conversions on demand</a></h3> <p>It would be convenient to convert formats on demand. For example, convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is requested.</p> <p>The problem is that most users expect an export to require no memory copy and no conversion: an <em>O</em>(1) complexity. It is better to have an API where all operations have an <em>O</em>(1) complexity.</p> </section> <section id="export-to-utf-8"> <h3><a class="toc-backref" href="#export-to-utf-8" role="doc-backlink">Export to UTF-8</a></h3> <p>CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to allow exporting to UTF-8.</p> <p>The problem is that the UTF-8 cache doesn’t support surrogate characters. An export is expected to provide the whole string content, including embedded NUL characters and surrogate characters. To export surrogate characters, a different code path using the <code class="docutils literal notranslate"><span class="pre">surrogatepass</span></code> error handler is needed and each export operation has to allocate a temporary buffer: <em>O</em>(n) complexity.</p> <p>An export is expected to have an <em>O</em>(1) complexity, so the idea to export UTF-8 in CPython was abadonned.</p> </section> </section> <section id="discussions"> <h2><a class="toc-backref" href="#discussions" role="doc-backlink">Discussions</a></h2> <ul class="simple"> <li><a class="reference external" href="https://discuss.python.org/t/63891">https://discuss.python.org/t/63891</a></li> <li><a class="reference external" href="https://github.com/capi-workgroup/decisions/issues/33">https://github.com/capi-workgroup/decisions/issues/33</a></li> <li><a class="reference external" href="https://github.com/python/cpython/issues/119609">https://github.com/python/cpython/issues/119609</a></li> </ul> </section> <section id="copyright"> <h2><a class="toc-backref" href="#copyright" role="doc-backlink">Copyright</a></h2> <p>This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.</p> </section> </section> <hr class="docutils" /> <p>Source: <a class="reference external" href="https://github.com/python/peps/blob/main/peps/pep-0756.rst">https://github.com/python/peps/blob/main/peps/pep-0756.rst</a></p> <p>Last modified: <a class="reference external" href="https://github.com/python/peps/commits/main/peps/pep-0756.rst">2024-10-29 17:09:35 GMT</a></p> </article> <nav id="pep-sidebar"> <h2>Contents</h2> <ul> <li><a class="reference internal" href="#abstract">Abstract</a></li> <li><a class="reference internal" href="#rationale">Rationale</a><ul> <li><a class="reference internal" href="#pep-393">PEP 393</a></li> <li><a class="reference internal" href="#limited-c-api">Limited C API</a></li> </ul> </li> <li><a class="reference internal" href="#specification">Specification</a><ul> <li><a class="reference internal" href="#api">API</a></li> <li><a class="reference internal" href="#pyunicode-export">PyUnicode_Export()</a></li> <li><a class="reference internal" href="#export-complexity">Export complexity</a></li> <li><a class="reference internal" href="#py-buffer-format-and-item-size">Py_buffer format and item size</a></li> <li><a class="reference internal" href="#pyunicode-import">PyUnicode_Import()</a></li> <li><a class="reference internal" href="#utf-8-format">UTF-8 format</a></li> <li><a class="reference internal" href="#ascii-format">ASCII format</a></li> <li><a class="reference internal" href="#surrogate-characters-and-embedded-nul-characters">Surrogate characters and embedded NUL characters</a></li> </ul> </li> <li><a class="reference internal" href="#implementation">Implementation</a></li> <li><a class="reference internal" href="#backwards-compatibility">Backwards Compatibility</a></li> <li><a class="reference internal" href="#usage-of-pep-393-c-apis">Usage of PEP 393 C APIs</a><ul> <li><a class="reference internal" href="#pyunicode-fromkindanddata">PyUnicode_FromKindAndData()</a></li> <li><a class="reference internal" href="#pyunicode-4byte-data">PyUnicode_4BYTE_DATA()</a></li> </ul> </li> <li><a class="reference internal" href="#rejected-ideas">Rejected Ideas</a><ul> <li><a class="reference internal" href="#reject-embedded-nul-characters-and-require-trailing-nul-character">Reject embedded NUL characters and require trailing NUL character</a></li> <li><a class="reference internal" href="#reject-surrogate-characters">Reject surrogate characters</a></li> <li><a class="reference internal" href="#conversions-on-demand">Conversions on demand</a></li> <li><a class="reference internal" href="#export-to-utf-8">Export to UTF-8</a></li> </ul> </li> <li><a class="reference internal" href="#discussions">Discussions</a></li> <li><a class="reference internal" href="#copyright">Copyright</a></li> </ul> <br> <a id="source" href="https://github.com/python/peps/blob/main/peps/pep-0756.rst">Page Source (GitHub)</a> </nav> </section> <script src="../_static/colour_scheme.js"></script> <script src="../_static/wrap_tables.js"></script> <script src="../_static/sticky_banner.js"></script> </body> </html>