CINXE.COM
PEP 529 – Change Windows filesystem encoding to UTF-8 | peps.python.org
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="color-scheme" content="light dark"> <title>PEP 529 – Change Windows filesystem encoding to UTF-8 | peps.python.org</title> <link rel="shortcut icon" href="../_static/py.png"> <link rel="canonical" href="https://peps.python.org/pep-0529/"> <link rel="stylesheet" href="../_static/style.css" type="text/css"> <link rel="stylesheet" href="../_static/mq.css" type="text/css"> <link rel="stylesheet" href="../_static/pygments.css" type="text/css" media="(prefers-color-scheme: light)" id="pyg-light"> <link rel="stylesheet" href="../_static/pygments_dark.css" type="text/css" media="(prefers-color-scheme: dark)" id="pyg-dark"> <link rel="alternate" type="application/rss+xml" title="Latest PEPs" href="https://peps.python.org/peps.rss"> <meta property="og:title" content='PEP 529 – Change Windows filesystem encoding to UTF-8 | peps.python.org'> <meta property="og:description" content="Historically, Python uses the ANSI APIs for interacting with the Windows operating system, often via C Runtime functions. However, these have been long discouraged in favor of the UTF-16 APIs. Within the operating system, all text is represented as UTF-..."> <meta property="og:type" content="website"> <meta property="og:url" content="https://peps.python.org/pep-0529/"> <meta property="og:site_name" content="Python Enhancement Proposals (PEPs)"> <meta property="og:image" content="https://peps.python.org/_static/og-image.png"> <meta property="og:image:alt" content="Python PEPs"> <meta property="og:image:width" content="200"> <meta property="og:image:height" content="200"> <meta name="description" content="Historically, Python uses the ANSI APIs for interacting with the Windows operating system, often via C Runtime functions. However, these have been long discouraged in favor of the UTF-16 APIs. Within the operating system, all text is represented as UTF-..."> <meta name="theme-color" content="#3776ab"> </head> <body> <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"> <symbol id="svg-sun-half" viewBox="0 0 24 24" pointer-events="all"> <title>Following system colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <circle cx="12" cy="12" r="9"></circle> <path d="M12 3v18m0-12l4.65-4.65M12 14.3l7.37-7.37M12 19.6l8.85-8.85"></path> </svg> </symbol> <symbol id="svg-moon" viewBox="0 0 24 24" pointer-events="all"> <title>Selected dark colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <path stroke="none" d="M0 0h24v24H0z" fill="none"></path> <path d="M12 3c.132 0 .263 0 .393 0a7.5 7.5 0 0 0 7.92 12.446a9 9 0 1 1 -8.313 -12.454z"></path> </svg> </symbol> <symbol id="svg-sun" viewBox="0 0 24 24" pointer-events="all"> <title>Selected light colour scheme</title> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <circle cx="12" cy="12" r="5"></circle> <line x1="12" y1="1" x2="12" y2="3"></line> <line x1="12" y1="21" x2="12" y2="23"></line> <line x1="4.22" y1="4.22" x2="5.64" y2="5.64"></line> <line x1="18.36" y1="18.36" x2="19.78" y2="19.78"></line> <line x1="1" y1="12" x2="3" y2="12"></line> <line x1="21" y1="12" x2="23" y2="12"></line> <line x1="4.22" y1="19.78" x2="5.64" y2="18.36"></line> <line x1="18.36" y1="5.64" x2="19.78" y2="4.22"></line> </svg> </symbol> </svg> <script> document.documentElement.dataset.colour_scheme = localStorage.getItem("colour_scheme") || "auto" </script> <section id="pep-page-section"> <header> <h1>Python Enhancement Proposals</h1> <ul class="breadcrumbs"> <li><a href="https://www.python.org/" title="The Python Programming Language">Python</a> » </li> <li><a href="../pep-0000/">PEP Index</a> » </li> <li>PEP 529</li> </ul> <button id="colour-scheme-cycler" onClick="setColourScheme(nextColourScheme())"> <svg aria-hidden="true" class="colour-scheme-icon-when-auto"><use href="#svg-sun-half"></use></svg> <svg aria-hidden="true" class="colour-scheme-icon-when-dark"><use href="#svg-moon"></use></svg> <svg aria-hidden="true" class="colour-scheme-icon-when-light"><use href="#svg-sun"></use></svg> <span class="visually-hidden">Toggle light / dark / auto colour theme</span> </button> </header> <article> <section id="pep-content"> <h1 class="page-title">PEP 529 – Change Windows filesystem encoding to UTF-8</h1> <dl class="rfc2822 field-list simple"> <dt class="field-odd">Author<span class="colon">:</span></dt> <dd class="field-odd">Steve Dower <steve.dower at python.org></dd> <dt class="field-even">Status<span class="colon">:</span></dt> <dd class="field-even"><abbr title="Accepted and implementation complete, or no longer active">Final</abbr></dd> <dt class="field-odd">Type<span class="colon">:</span></dt> <dd class="field-odd"><abbr title="Normative PEP with a new feature for Python, implementation change for CPython or interoperability standard for the ecosystem">Standards Track</abbr></dd> <dt class="field-even">Created<span class="colon">:</span></dt> <dd class="field-even">27-Aug-2016</dd> <dt class="field-odd">Python-Version<span class="colon">:</span></dt> <dd class="field-odd">3.6</dd> <dt class="field-even">Post-History<span class="colon">:</span></dt> <dd class="field-even">01-Sep-2016, 04-Sep-2016</dd> <dt class="field-odd">Resolution<span class="colon">:</span></dt> <dd class="field-odd"><a class="reference external" href="https://mail.python.org/pipermail/python-dev/2016-September/146277.html">Python-Dev message</a></dd> </dl> <hr class="docutils" /> <section id="contents"> <details><summary>Table of Contents</summary><ul class="simple"> <li><a class="reference internal" href="#abstract">Abstract</a></li> <li><a class="reference internal" href="#background">Background</a></li> <li><a class="reference internal" href="#proposal">Proposal</a></li> <li><a class="reference internal" href="#specific-changes">Specific Changes</a><ul> <li><a class="reference internal" href="#update-sys-getfilesystemencoding">Update sys.getfilesystemencoding</a></li> <li><a class="reference internal" href="#add-sys-getfilesystemencodeerrors">Add sys.getfilesystemencodeerrors</a></li> <li><a class="reference internal" href="#update-path-converter">Update path_converter</a></li> <li><a class="reference internal" href="#remove-unused-ansi-code">Remove unused ANSI code</a></li> <li><a class="reference internal" href="#add-legacy-mode">Add legacy mode</a></li> <li><a class="reference internal" href="#undeprecate-bytes-paths-on-windows">Undeprecate bytes paths on Windows</a></li> <li><a class="reference internal" href="#beta-experiment">Beta experiment</a></li> <li><a class="reference internal" href="#affected-modules">Affected Modules</a></li> </ul> </li> <li><a class="reference internal" href="#rejected-alternatives">Rejected Alternatives</a><ul> <li><a class="reference internal" href="#use-strict-mbcs-decoding">Use strict mbcs decoding</a></li> <li><a class="reference internal" href="#make-bytes-paths-an-error-on-windows">Make bytes paths an error on Windows</a></li> <li><a class="reference internal" href="#make-bytes-paths-an-error-on-all-platforms">Make bytes paths an error on all platforms</a></li> </ul> </li> <li><a class="reference internal" href="#code-that-may-break">Code that may break</a><ul> <li><a class="reference internal" href="#not-managing-encodings-across-boundaries">Not managing encodings across boundaries</a></li> <li><a class="reference internal" href="#explicitly-using-mbcs">Explicitly using ‘mbcs’</a></li> </ul> </li> <li><a class="reference internal" href="#copyright">Copyright</a></li> </ul> </details></section> <section id="abstract"> <h2><a class="toc-backref" href="#abstract" role="doc-backlink">Abstract</a></h2> <p>Historically, Python uses the ANSI APIs for interacting with the Windows operating system, often via C Runtime functions. However, these have been long discouraged in favor of the UTF-16 APIs. Within the operating system, all text is represented as UTF-16, and the ANSI APIs perform encoding and decoding using the active code page. See <a class="reference external" href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx">Naming Files, Paths, and Namespaces</a> for more details.</p> <p>This PEP proposes changing the default filesystem encoding on Windows to utf-8, and changing all filesystem functions to use the Unicode APIs for filesystem paths. This will not affect code that uses strings to represent paths, however those that use bytes for paths will now be able to correctly round-trip all valid paths in Windows filesystems. Currently, the conversions between Unicode (in the OS) and bytes (in Python) were lossy and would fail to round-trip characters outside of the user’s active code page.</p> <p>Notably, this does not impact the encoding of the contents of files. These will continue to default to <code class="docutils literal notranslate"><span class="pre">locale.getpreferredencoding()</span></code> (for text files) or plain bytes (for binary files). This only affects the encoding used when users pass a bytes object to Python where it is then passed to the operating system as a path name.</p> </section> <section id="background"> <h2><a class="toc-backref" href="#background" role="doc-backlink">Background</a></h2> <p>File system paths are almost universally represented as text with an encoding determined by the file system. In Python, we expose these paths via a number of interfaces, such as the <code class="docutils literal notranslate"><span class="pre">os</span></code> and <code class="docutils literal notranslate"><span class="pre">io</span></code> modules. Paths may be passed either direction across these interfaces, that is, from the filesystem to the application (for example, <code class="docutils literal notranslate"><span class="pre">os.listdir()</span></code>), or from the application to the filesystem (for example, <code class="docutils literal notranslate"><span class="pre">os.unlink()</span></code>).</p> <p>When paths are passed between the filesystem and the application, they are either passed through as a bytes blob or converted to/from str using <code class="docutils literal notranslate"><span class="pre">os.fsencode()</span></code> and <code class="docutils literal notranslate"><span class="pre">os.fsdecode()</span></code> or explicit encoding using <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code>. The result of encoding a string with <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code> is a blob of bytes in the native format for the default file system.</p> <p>On Windows, the native format for the filesystem is utf-16-le. The recommended platform APIs for accessing the filesystem all accept and return text encoded in this format. However, prior to Windows NT (and possibly further back), the native format was a configurable machine option and a separate set of APIs existed to accept this format. The option (the “active code page”) and these APIs (the “*A functions”) still exist in recent versions of Windows for backwards compatibility, though new functionality often only has a utf-16-le API (the “*W functions”).</p> <p>In Python, str is recommended because it can correctly round-trip all characters used in paths (on POSIX with surrogateescape handling; on Windows because str maps to the native representation). On Windows bytes cannot round-trip all characters used in paths, as Python internally uses the *A functions and hence the encoding is “whatever the active code page is”. Since the active code page cannot represent all Unicode characters, the conversion of a path into bytes can lose information without warning or any available indication.</p> <p>As a demonstration of this:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="nb">open</span><span class="p">(</span><span class="s1">'test</span><span class="se">\uAB00</span><span class="s1">.txt'</span><span class="p">,</span> <span class="s1">'wb'</span><span class="p">)</span><span class="o">.</span><span class="n">close</span><span class="p">()</span> <span class="gp">>>> </span><span class="kn">import</span><span class="w"> </span><span class="nn">glob</span> <span class="gp">>>> </span><span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">'test*'</span><span class="p">)</span> <span class="go">['test\uab00.txt']</span> <span class="gp">>>> </span><span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="sa">b</span><span class="s1">'test*'</span><span class="p">)</span> <span class="go">[b'test?.txt']</span> </pre></div> </div> <p>The Unicode character in the second call to glob has been replaced by a ‘?’, which means passing the path back into the filesystem will result in a <code class="docutils literal notranslate"><span class="pre">FileNotFoundError</span></code>. The same results may be observed with <code class="docutils literal notranslate"><span class="pre">os.listdir()</span></code> or any function that matches the return type to the parameter type.</p> <p>While one user-accessible fix is to use str everywhere, POSIX systems generally do not suffer from data loss when using bytes exclusively as the bytes are the canonical representation. Even if the encoding is “incorrect” by some standard, the file system will still map the bytes back to the file. Making use of this avoids the cost of decoding and reencoding, such that (theoretically, and only on POSIX), code such as this may be faster because of the use of <code class="docutils literal notranslate"><span class="pre">b'.'</span></code> compared to using <code class="docutils literal notranslate"><span class="pre">'.'</span></code>:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="sa">b</span><span class="s1">'.'</span><span class="p">):</span> <span class="gp">... </span> <span class="n">os</span><span class="o">.</span><span class="n">stat</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> <span class="gp">...</span> </pre></div> </div> <p>As a result, POSIX-focused library authors prefer to use bytes to represent paths. For some authors it is also a convenience, as their code may receive bytes already known to be encoded correctly, while others are attempting to simplify porting their code from Python 2. However, the correctness assumptions do not carry over to Windows where Unicode is the canonical representation, and errors may result. This potential data loss is why the use of bytes paths on Windows was deprecated in Python 3.3 - all of the above code snippets produce deprecation warnings on Windows.</p> </section> <section id="proposal"> <h2><a class="toc-backref" href="#proposal" role="doc-backlink">Proposal</a></h2> <p>Currently the default filesystem encoding is ‘mbcs’, which is a meta-encoder that uses the active code page. However, when bytes are passed to the filesystem they go through the *A APIs and the operating system handles encoding. In this case, paths are always encoded using the equivalent of ‘mbcs:replace’ with no opportunity for Python to override or change this.</p> <p>This proposal would remove all use of the *A APIs and only ever call the *W APIs. When Windows returns paths to Python as <code class="docutils literal notranslate"><span class="pre">str</span></code>, they will be decoded from utf-16-le and returned as text (in whatever the minimal representation is). When Python code requests paths as <code class="docutils literal notranslate"><span class="pre">bytes</span></code>, the paths will be transcoded from utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it is possible to have invalid surrogates in filenames). Equally, when paths are provided as <code class="docutils literal notranslate"><span class="pre">bytes</span></code>, they are transcoded from utf-8 into utf-16-le and passed to the *W APIs.</p> <p>The use of utf-8 will not be configurable, except for the provision of a “legacy mode” flag to revert to the previous behaviour.</p> <p>The <code class="docutils literal notranslate"><span class="pre">surrogateescape</span></code> error mode does not apply here, as the concern is not about retaining nonsensical bytes. Any path returned from the operating system will be valid Unicode, while invalid paths created by the user should raise a decoding error (currently these would raise <code class="docutils literal notranslate"><span class="pre">OSError</span></code> or a subclass).</p> <p>The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the ability to round-trip path names and allow basic manipulation (for example, using the <code class="docutils literal notranslate"><span class="pre">os.path</span></code> module) when assuming an ASCII-compatible encoding. Using utf-16-le as the encoding is more pure, but will cause more issues than are resolved.</p> <p>This change would also undeprecate the use of bytes paths on Windows. No change to the semantics of using bytes as a path is required - as before, they must be encoded with the encoding specified by <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code>.</p> </section> <section id="specific-changes"> <h2><a class="toc-backref" href="#specific-changes" role="doc-backlink">Specific Changes</a></h2> <section id="update-sys-getfilesystemencoding"> <h3><a class="toc-backref" href="#update-sys-getfilesystemencoding" role="doc-backlink">Update sys.getfilesystemencoding</a></h3> <p>Remove the default value for <code class="docutils literal notranslate"><span class="pre">Py_FileSystemDefaultEncoding</span></code> and set it in <code class="docutils literal notranslate"><span class="pre">initfsencoding()</span></code> to utf-8, or if the legacy-mode switch is enabled to mbcs.</p> <p>Update the implementations of <code class="docutils literal notranslate"><span class="pre">PyUnicode_DecodeFSDefaultAndSize()</span></code> and <code class="docutils literal notranslate"><span class="pre">PyUnicode_EncodeFSDefault()</span></code> to use the utf-8 codec, or if the legacy-mode switch is enabled the existing mbcs codec.</p> </section> <section id="add-sys-getfilesystemencodeerrors"> <h3><a class="toc-backref" href="#add-sys-getfilesystemencodeerrors" role="doc-backlink">Add sys.getfilesystemencodeerrors</a></h3> <p>As the error mode may now change between <code class="docutils literal notranslate"><span class="pre">surrogatepass</span></code> and <code class="docutils literal notranslate"><span class="pre">replace</span></code>, Python code that manually performs encoding also needs access to the current error mode. This includes the implementation of <code class="docutils literal notranslate"><span class="pre">os.fsencode()</span></code> and <code class="docutils literal notranslate"><span class="pre">os.fsdecode()</span></code>, which currently assume an error mode based on the codec.</p> <p>Add a public <code class="docutils literal notranslate"><span class="pre">Py_FileSystemDefaultEncodeErrors</span></code>, similar to the existing <code class="docutils literal notranslate"><span class="pre">Py_FileSystemDefaultEncoding</span></code>. The default value on Windows will be <code class="docutils literal notranslate"><span class="pre">surrogatepass</span></code> or in legacy mode, <code class="docutils literal notranslate"><span class="pre">replace</span></code>. The default value on all other platforms will be <code class="docutils literal notranslate"><span class="pre">surrogateescape</span></code>.</p> <p>Add a public <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencodeerrors()</span></code> function that returns the current error mode.</p> <p>Update the implementations of <code class="docutils literal notranslate"><span class="pre">PyUnicode_DecodeFSDefaultAndSize()</span></code> and <code class="docutils literal notranslate"><span class="pre">PyUnicode_EncodeFSDefault()</span></code> to use the variable for error mode rather than constant strings.</p> <p>Update the implementations of <code class="docutils literal notranslate"><span class="pre">os.fsencode()</span></code> and <code class="docutils literal notranslate"><span class="pre">os.fsdecode()</span></code> to use <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencodeerrors()</span></code> instead of assuming the mode.</p> </section> <section id="update-path-converter"> <h3><a class="toc-backref" href="#update-path-converter" role="doc-backlink">Update path_converter</a></h3> <p>Update the path converter to always decode bytes or buffer objects into text using <code class="docutils literal notranslate"><span class="pre">PyUnicode_DecodeFSDefaultAndSize()</span></code>.</p> <p>Change the <code class="docutils literal notranslate"><span class="pre">narrow</span></code> field from a <code class="docutils literal notranslate"><span class="pre">char*</span></code> string into a flag that indicates whether the original object was bytes. This is required for functions that need to return paths using the same type as was originally provided.</p> </section> <section id="remove-unused-ansi-code"> <h3><a class="toc-backref" href="#remove-unused-ansi-code" role="doc-backlink">Remove unused ANSI code</a></h3> <p>Remove all code paths using the <code class="docutils literal notranslate"><span class="pre">narrow</span></code> field, as these will no longer be reachable by any caller. These are only used within <code class="docutils literal notranslate"><span class="pre">posixmodule.c</span></code>. Other uses of paths should have use of bytes paths replaced with decoding and use of the *W APIs.</p> </section> <section id="add-legacy-mode"> <h3><a class="toc-backref" href="#add-legacy-mode" role="doc-backlink">Add legacy mode</a></h3> <p>Add a legacy mode flag, enabled by the environment variable <code class="docutils literal notranslate"><span class="pre">PYTHONLEGACYWINDOWSFSENCODING</span></code> or by a function call to <code class="docutils literal notranslate"><span class="pre">sys._enablelegacywindowsfsencoding()</span></code>. The function call can only be used to enable the flag and should be used by programs as close to initialization as possible. Legacy mode cannot be disabled while Python is running.</p> <p>When this flag is set, the default filesystem encoding is set to mbcs rather than utf-8, and the error mode is set to <code class="docutils literal notranslate"><span class="pre">replace</span></code> rather than <code class="docutils literal notranslate"><span class="pre">surrogatepass</span></code>. Paths will continue to decode to wide characters and only *W APIs will be called, however, the bytes passed in and received from Python will be encoded the same as prior to this change.</p> </section> <section id="undeprecate-bytes-paths-on-windows"> <h3><a class="toc-backref" href="#undeprecate-bytes-paths-on-windows" role="doc-backlink">Undeprecate bytes paths on Windows</a></h3> <p>Using bytes as paths on Windows is currently deprecated. We would announce that this is no longer the case, and that paths when encoded as bytes should use whatever is returned from <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code> rather than the user’s active code page.</p> </section> <section id="beta-experiment"> <h3><a class="toc-backref" href="#beta-experiment" role="doc-backlink">Beta experiment</a></h3> <p>To assist with determining the impact of this change, we propose applying it to 3.6.0b1 provisionally with the intent being to make a final decision before 3.6.0b4.</p> <p>During the experiment period, decoding and encoding exception messages will be expanded to include a link to an active online discussion and encourage reporting of problems.</p> <p>If it is decided to revert the functionality for 3.6.0b4, the implementation change would be to permanently enable the legacy mode flag, change the environment variable to <code class="docutils literal notranslate"><span class="pre">PYTHONWINDOWSUTF8FSENCODING</span></code> and function to <code class="docutils literal notranslate"><span class="pre">sys._enablewindowsutf8fsencoding()</span></code> to allow enabling the functionality on a case-by-case basis, as opposed to disabling it.</p> <p>It is expected that if we cannot feasibly make the change for 3.6 due to compatibility concerns, it will not be possible to make the change at any later time in Python 3.x.</p> </section> <section id="affected-modules"> <h3><a class="toc-backref" href="#affected-modules" role="doc-backlink">Affected Modules</a></h3> <p>This PEP implicitly includes all modules within the Python that either pass path names to the operating system, or otherwise use <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code>.</p> <p>As of 3.6.0a4, the following modules require modification:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">os</span></code></li> <li><code class="docutils literal notranslate"><span class="pre">_overlapped</span></code></li> <li><code class="docutils literal notranslate"><span class="pre">_socket</span></code></li> <li><code class="docutils literal notranslate"><span class="pre">subprocess</span></code></li> <li><code class="docutils literal notranslate"><span class="pre">zipimport</span></code></li> </ul> <p>The following modules use <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code> but do not need modification:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">gc</span></code> (already assumes bytes are utf-8)</li> <li><code class="docutils literal notranslate"><span class="pre">grp</span></code> (not compiled for Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">http.server</span></code> (correctly includes codec name with transmitted data)</li> <li><code class="docutils literal notranslate"><span class="pre">idlelib.editor</span></code> (should not be needed; has fallback handling)</li> <li><code class="docutils literal notranslate"><span class="pre">nis</span></code> (not compiled for Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">pwd</span></code> (not compiled for Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">spwd</span></code> (not compiled for Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">_ssl</span></code> (only used for ASCII constants)</li> <li><code class="docutils literal notranslate"><span class="pre">tarfile</span></code> (code unused on Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">_tkinter</span></code> (already assumes bytes are utf-8)</li> <li><code class="docutils literal notranslate"><span class="pre">wsgiref</span></code> (assumed as the default encoding for unknown environments)</li> <li><code class="docutils literal notranslate"><span class="pre">zipapp</span></code> (code unused on Windows)</li> </ul> <p>The following native code uses one of the encoding or decoding functions, but do not require any modification:</p> <ul class="simple"> <li><code class="docutils literal notranslate"><span class="pre">Parser/parsetok.c</span></code> (docs already specify <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code>)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/ast.c</span></code> (docs already specify <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code>)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/compile.c</span></code> (undocumented, but Python filesystem encoding implied)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/errors.c</span></code> (docs already specify <code class="docutils literal notranslate"><span class="pre">os.fsdecode()</span></code>)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/fileutils.c</span></code> (code unused on Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/future.c</span></code> (undocumented, but Python filesystem encoding implied)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/import.c</span></code> (docs already specify utf-8)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/importdl.c</span></code> (code unused on Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/pythonrun.c</span></code> (docs already specify <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code>)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/symtable.c</span></code> (undocumented, but Python filesystem encoding implied)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/thread.c</span></code> (code unused on Windows)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/traceback.c</span></code> (encodes correctly for comparing strings)</li> <li><code class="docutils literal notranslate"><span class="pre">Python/_warnings.c</span></code> (docs already specify <code class="docutils literal notranslate"><span class="pre">os.fsdecode()</span></code>)</li> </ul> </section> </section> <section id="rejected-alternatives"> <h2><a class="toc-backref" href="#rejected-alternatives" role="doc-backlink">Rejected Alternatives</a></h2> <section id="use-strict-mbcs-decoding"> <h3><a class="toc-backref" href="#use-strict-mbcs-decoding" role="doc-backlink">Use strict mbcs decoding</a></h3> <p>This is essentially the same as the proposed change, but instead of changing <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code> to utf-8 it is changed to mbcs (which dynamically maps to the active code page).</p> <p>This approach allows the use of new functionality that is only available as *W APIs and also detection of encoding/decoding errors. For example, rather than silently replacing Unicode characters with ‘?’, it would be possible to warn or fail the operation.</p> <p>Compared to the proposed fix, this could enable some new functionality but does not fix any of the problems described initially. New runtime errors may cause some problems to be more obvious and lead to fixes, provided library maintainers are interested in supporting Windows and adding a separate code path to treat filesystem paths as strings.</p> <p>Making the encoding mbcs without strict errors is equivalent to the legacy-mode switch being enabled by default. This is a possible course of action if there is significant breakage of actual code and a need to extend the deprecation period, but still a desire to have the simplifications to the CPython source.</p> </section> <section id="make-bytes-paths-an-error-on-windows"> <h3><a class="toc-backref" href="#make-bytes-paths-an-error-on-windows" role="doc-backlink">Make bytes paths an error on Windows</a></h3> <p>By preventing the use of bytes paths on Windows completely we prevent users from hitting encoding issues.</p> <p>However, the motivation for this PEP is to increase the likelihood that code written on POSIX will also work correctly on Windows. This alternative would move the other direction and make such code completely incompatible. As this does not benefit users in any way, we reject it.</p> </section> <section id="make-bytes-paths-an-error-on-all-platforms"> <h3><a class="toc-backref" href="#make-bytes-paths-an-error-on-all-platforms" role="doc-backlink">Make bytes paths an error on all platforms</a></h3> <p>By deprecating and then disable the use of bytes paths on all platforms we prevent users from hitting encoding issues regardless of where the code was originally written. This would require a full deprecation cycle, as there are currently no warnings on platforms other than Windows.</p> <p>This is likely to be seen as a hostile action against Python developers in general, and as such is rejected at this time.</p> </section> </section> <section id="code-that-may-break"> <h2><a class="toc-backref" href="#code-that-may-break" role="doc-backlink">Code that may break</a></h2> <p>The following code patterns may break or see different behaviour as a result of this change. Each of these examples would have been fragile in code intended for cross-platform use. The suggested fixes demonstrate the most compatible way to handle path encoding issues across all platforms and across multiple Python versions.</p> <p>Note that all of these examples produce deprecation warnings on Python 3.3 and later.</p> <section id="not-managing-encodings-across-boundaries"> <h3><a class="toc-backref" href="#not-managing-encodings-across-boundaries" role="doc-backlink">Not managing encodings across boundaries</a></h3> <p>Code that does not manage encodings when crossing protocol boundaries may currently be working by chance, but could encounter issues when either encoding changes. Note that the source of <code class="docutils literal notranslate"><span class="pre">filename</span></code> may be any function that returns a bytes object, as illustrated in a second example below:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'filename_in_mbcs.txt'</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> </pre></div> </div> <p>To correct this code, the encoding of the bytes in <code class="docutils literal notranslate"><span class="pre">filename</span></code> should be specified, either when reading from the file or before using the value:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="c1"># Fix 1: Open file as text (default encoding)</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'filename_in_mbcs.txt'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="gp">>>> </span><span class="c1"># Fix 2: Open file as text (explicit encoding)</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'filename_in_mbcs.txt'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">'mbcs'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="gp">>>> </span><span class="c1"># Fix 3: Explicitly decode the path</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'filename_in_mbcs.txt'</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">'mbcs'</span><span class="p">),</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> </pre></div> </div> <p>Where the creator of <code class="docutils literal notranslate"><span class="pre">filename</span></code> is separated from the user of <code class="docutils literal notranslate"><span class="pre">filename</span></code>, the encoding is important information to include:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="o">=</span> <span class="sa">r</span><span class="s1">'C:\Users\Steve\Documents\my_file.txt'</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'mbcs'</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="gp">>>> </span><span class="nb">type</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="go"><class 'bytes'></span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> </pre></div> </div> <p>To fix this code for best compatibility across operating systems and Python versions, the filename should be exposed as str:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="c1"># Fix 1: Expose as str</span> <span class="gp">>>> </span><span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="o">=</span> <span class="sa">r</span><span class="s1">'C:\Users\Steve\Documents\my_file.txt'</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="gp">>>> </span><span class="nb">type</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="go"><class 'str'></span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> </pre></div> </div> <p>Alternatively, the encoding used for the path needs to be made available to the user. Specifying <code class="docutils literal notranslate"><span class="pre">os.fsencode()</span></code> (or <code class="docutils literal notranslate"><span class="pre">sys.getfilesystemencoding()</span></code>) is an acceptable choice, or a new attribute could be added with the exact encoding:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="c1"># Fix 2: Use fsencode</span> <span class="gp">>>> </span><span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">fsencode</span><span class="p">(</span><span class="sa">r</span><span class="s1">'C:\Users\Steve\Documents\my_file.txt'</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="gp">>>> </span><span class="nb">type</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="go"><class 'bytes'></span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="gp">>>> </span><span class="c1"># Fix 3: Expose as explicit encoding</span> <span class="gp">>>> </span><span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="o">=</span> <span class="sa">r</span><span class="s1">'C:\Users\Steve\Documents\my_file.txt'</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'cp437'</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">some_object</span><span class="o">.</span><span class="n">filename_encoding</span> <span class="o">=</span> <span class="s1">'cp437'</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="n">some_object</span><span class="o">.</span><span class="n">filename</span> <span class="gp">>>> </span><span class="nb">type</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="go"><class 'bytes'></span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="n">filename</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">some_object</span><span class="o">.</span><span class="n">filename_encoding</span><span class="p">)</span> <span class="gp">>>> </span><span class="nb">type</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="go"><class 'str'></span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> </pre></div> </div> </section> <section id="explicitly-using-mbcs"> <h3><a class="toc-backref" href="#explicitly-using-mbcs" role="doc-backlink">Explicitly using ‘mbcs’</a></h3> <p>Code that explicitly encodes text using ‘mbcs’ before passing to file system APIs is now passing incorrectly encoded bytes. Note that the source of <code class="docutils literal notranslate"><span class="pre">filename</span></code> in this example is not relevant, provided that it is a str:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'files.txt'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span><span class="o">.</span><span class="n">rstrip</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'mbcs'</span><span class="p">),</span> <span class="s1">'r'</span><span class="p">)</span> </pre></div> </div> <p>To correct this code, the string should be passed without explicit encoding, or should use <code class="docutils literal notranslate"><span class="pre">os.fsencode()</span></code>:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="c1"># Fix 1: Do not encode the string</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'files.txt'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span><span class="o">.</span><span class="n">rstrip</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="gp">>>> </span><span class="c1"># Fix 2: Use correct encoding</span> <span class="gp">>>> </span><span class="n">filename</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'files.txt'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span><span class="o">.</span><span class="n">rstrip</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">fsencode</span><span class="p">(</span><span class="n">filename</span><span class="p">),</span> <span class="s1">'r'</span><span class="p">)</span> </pre></div> </div> </section> </section> <section id="copyright"> <h2><a class="toc-backref" href="#copyright" role="doc-backlink">Copyright</a></h2> <p>This document has been placed in the public domain.</p> </section> </section> <hr class="docutils" /> <p>Source: <a class="reference external" href="https://github.com/python/peps/blob/main/peps/pep-0529.rst">https://github.com/python/peps/blob/main/peps/pep-0529.rst</a></p> <p>Last modified: <a class="reference external" href="https://github.com/python/peps/commits/main/peps/pep-0529.rst">2025-02-01 08:59:27 GMT</a></p> </article> <nav id="pep-sidebar"> <h2>Contents</h2> <ul> <li><a class="reference internal" href="#abstract">Abstract</a></li> <li><a class="reference internal" href="#background">Background</a></li> <li><a class="reference internal" href="#proposal">Proposal</a></li> <li><a class="reference internal" href="#specific-changes">Specific Changes</a><ul> <li><a class="reference internal" href="#update-sys-getfilesystemencoding">Update sys.getfilesystemencoding</a></li> <li><a class="reference internal" href="#add-sys-getfilesystemencodeerrors">Add sys.getfilesystemencodeerrors</a></li> <li><a class="reference internal" href="#update-path-converter">Update path_converter</a></li> <li><a class="reference internal" href="#remove-unused-ansi-code">Remove unused ANSI code</a></li> <li><a class="reference internal" href="#add-legacy-mode">Add legacy mode</a></li> <li><a class="reference internal" href="#undeprecate-bytes-paths-on-windows">Undeprecate bytes paths on Windows</a></li> <li><a class="reference internal" href="#beta-experiment">Beta experiment</a></li> <li><a class="reference internal" href="#affected-modules">Affected Modules</a></li> </ul> </li> <li><a class="reference internal" href="#rejected-alternatives">Rejected Alternatives</a><ul> <li><a class="reference internal" href="#use-strict-mbcs-decoding">Use strict mbcs decoding</a></li> <li><a class="reference internal" href="#make-bytes-paths-an-error-on-windows">Make bytes paths an error on Windows</a></li> <li><a class="reference internal" href="#make-bytes-paths-an-error-on-all-platforms">Make bytes paths an error on all platforms</a></li> </ul> </li> <li><a class="reference internal" href="#code-that-may-break">Code that may break</a><ul> <li><a class="reference internal" href="#not-managing-encodings-across-boundaries">Not managing encodings across boundaries</a></li> <li><a class="reference internal" href="#explicitly-using-mbcs">Explicitly using ‘mbcs’</a></li> </ul> </li> <li><a class="reference internal" href="#copyright">Copyright</a></li> </ul> <br> <a id="source" href="https://github.com/python/peps/blob/main/peps/pep-0529.rst">Page Source (GitHub)</a> </nav> </section> <script src="../_static/colour_scheme.js"></script> <script src="../_static/wrap_tables.js"></script> <script src="../_static/sticky_banner.js"></script> </body> </html>