This change would make it easier to write code that works with either string type and would also make some existing code handle unicode strings. The C function <code class="docutils literal notranslate"><span class="pre">PyObject_Str()</span></code> would remain unchanged and the function <code class="docutils literal notranslate"><span class="pre">PyString_New()</span></code> would be added instead.</p> </section> <section id="rationale"> <h2><a class="toc-backref" href="#rationale" role="doc-backlink">Rationale</a></h2> <p>Python has had a Unicode string type for some time now but use of it is not yet widespread. There is a large amount of Python code that assumes that string data is represented as str instances. The long-term plan for Python is to phase out the str type and use unicode for all string data. Clearly, a smooth migration path must be provided.</p> <p>We need to upgrade existing libraries, written for str instances, to be made capable of operating in an all-unicode string world. We can’t change to an all-unicode world until all essential libraries are made capable for it. Upgrading the libraries in one shot does not seem feasible. A more realistic strategy is to individually make the libraries capable of operating on unicode strings while preserving their current all-str environment behaviour.</p> <p>First, we need to be able to write code that can accept unicode instances without attempting to coerce them to str instances. Let us label such code as Unicode-safe. Unicode-safe libraries can be used in an all-unicode world.</p> <p>Second, we need to be able to write code that, when provided only str instances, will not create unicode results. Let us label such code as str-stable. Libraries that are str-stable can be used by libraries and applications that are not yet Unicode-safe.</p> <p>Sometimes it is simple to write code that is both str-stable and Unicode-safe. For example, the following function just works:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">appendx</span><span class="p">(</span><span class="n">s</span><span class="p">):</span> <span class="k">return</span> <span class="n">s</span> <span class="o">+</span> <span class="s1">'x'</span> </pre></div> </div> <p>That’s not too surprising since the unicode type is designed to make the task easier. The principle is that when str and unicode instances meet, the result is a unicode instance. One notable difficulty arises when code requires a string representation of an object; an operation traditionally accomplished by using the <code class="docutils literal notranslate"><span class="pre">str()</span></code> built-in function.</p> <p>Using the current <code class="docutils literal notranslate"><span class="pre">str()</span></code> function makes the code not Unicode-safe. Replacing a <code class="docutils literal notranslate"><span class="pre">str()</span></code> call with a <code class="docutils literal notranslate"><span class="pre">unicode()</span></code> call makes the code not str-stable. Changing <code class="docutils literal notranslate"><span class="pre">str()</span></code> so that it could return unicode instances would solve this problem. As a further benefit, some code that is currently not Unicode-safe because it uses <code class="docutils literal notranslate"><span class="pre">str()</span></code> would become Unicode-safe.</p> </section> <section id="specification"> <h2><a class="toc-backref" href="#specification" role="doc-backlink">Specification</a></h2> <p>A Python implementation of the <code class="docutils literal notranslate"><span class="pre">str()</span></code> built-in follows:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">str</span><span class="p">(</span><span class="n">s</span><span class="p">):</span> <span class="w"> </span><span class="sd">"""Return a nice string representation of the object. The</span> <span class="sd"> return value is a str or unicode instance.</span> <span class="sd"> """</span> <span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="ow">is</span> <span class="nb">str</span> <span class="ow">or</span> <span class="nb">type</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="ow">is</span> <span class="n">unicode</span><span class="p">:</span> <span class="k">return</span> <span class="n">s</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="fm">__str__</span><span class="p">()</span> <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">unicode</span><span class="p">)):</span> <span class="k">raise</span> <span class="ne">TypeError</span><span class="p">(</span><span class="s1">'__str__ returned non-string'</span><span class="p">)</span> <span class="k">return</span> <span class="n">r</span> </pre></div> </div> <p>The following function would be added to the C API and would be the equivalent to the <code class="docutils literal notranslate"><span class="pre">str()</span></code> built-in (ideally it be called <code class="docutils literal notranslate"><span class="pre">PyObject_Str</span></code>, but changing that function could cause a massive number of compatibility problems):</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">PyObject</span> <span class="o">*</span><span class="n">PyString_New</span><span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="p">);</span> </pre></div> </div> <p>A reference implementation is available on Sourceforge <a class="footnote-reference brackets" href="#id2" id="id1">[1]</a> as a patch.</p> </section> <section id="backwards-compatibility"> <h2><a class="toc-backref" href="#backwards-compatibility" role="doc-backlink">Backwards Compatibility</a></h2> <p>Some code may require that <code class="docutils literal notranslate"><span class="pre">str()</span></code> returns a str instance. In the standard library, only one such case has been found so far. The function <code class="docutils literal notranslate"><span class="pre">email.header_decode()</span></code> requires a str instance and the <code class="docutils literal notranslate"><span class="pre">email.Header.decode_header()</span></code> function tries to ensure this by calling <code class="docutils literal notranslate"><span class="pre">str()</span></code> on its argument. The code was fixed by changing the line “header = str(header)” to:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">header</span><span class="p">,</span> <span class="n">unicode</span><span class="p">):</span> <span class="n">header</span> <span class="o">=</span> <span class="n">header</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'ascii'</span><span class="p">)</span> </pre></div> </div> <p>Whether this is truly a bug is questionable since <code class="docutils literal notranslate"><span class="pre">decode_header()</span></code> really operates on byte strings, not character strings. Code that passes it a unicode instance could itself be considered buggy.</p> </section> <section id="alternative-solutions"> <h2><a class="toc-backref" href="#alternative-solutions" role="doc-backlink">Alternative Solutions</a></h2> <p>A new built-in function could be added instead of changing <code class="docutils literal notranslate"><span class="pre">str()</span></code>. Doing so would introduce virtually no backwards compatibility problems. However, since the compatibility problems are expected to rare, changing <code class="docutils literal notranslate"><span class="pre">str()</span></code> seems preferable to adding a new built-in.</p> <p>The basestring type could be changed to have the proposed behaviour, rather than changing <code class="docutils literal notranslate"><span class="pre">str()</span></code>. 