CINXE.COM

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=US-ASCII"> <style type="text/css"> body { color: #000000; background-color: #FFFFFF; } del { text-decoration: line-through; color: #8B0040; } ins { text-decoration: underline; color: #005100; } p.example { margin-left: 2em; } pre.example { margin-left: 2em; } div.example { margin-left: 2em; } code.extract { background-color: #F5F6A2; } pre.extract { margin-left: 2em; background-color: #F5F6A2; border: 1px solid #E1E28E; } p.function { } .attribute { margin-left: 2em; } .attribute dt { float: left; font-style: italic; padding-right: 1ex; } .attribute dd { margin-left: 0em; } blockquote.std { color: #000000; background-color: #F1F1F1; border: 1px solid #D1D1D1; padding-left: 0.5em; padding-right: 0.5em; } blockquote.stddel { text-decoration: line-through; color: #000000; background-color: #FFEBFF; border: 1px solid #ECD7EC; padding-left: 0.5empadding-right: 0.5em; ; } blockquote.stdins { text-decoration: underline; color: #000000; background-color: #C8FFC8; border: 1px solid #B3EBB3; padding: 0.5em; } table { border: 1px solid black; border-spacing: 0px; margin-left: auto; margin-right: auto; } th { text-align: left; vertical-align: top; padding-left: 0.8em; border: none; } td { text-align: left; vertical-align: top; padding-left: 0.8em; border: none; } </style> <title>Digit Separators</title> </head> <body> <h1>Digit Separators</h1> <p> ISO/IEC JTC1 SC22 WG21 N3499 - 2012-12-19 </p> <address> Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org </address> <p> <a href="#Problem">Problem</a><br> <a href="#Solution">Solution</a><br> <a href="#Constraints">Constraints</a><br>     <a href="#Ambiguity">Program Ambiguity</a><br>     <a href="#Lexical">Lexical Language Compatibility</a><br>     <a href="#Extension">Extension Language Compatibility</a><br> <a href="#Existing">Existing Grammar</a><br>     <a href="#old.lex.charset">2.3 Character sets [lex.charset]</a><br>     <a href="#old.lex.pptoken">2.5 Preprocessing tokens [lex.pptoken]</a><br>     <a href="#old.lex.ppnumber">2.10 Preprocessing numbers [lex.ppnumber]</a><br>     <a href="#old.lex.name">2.11 Identifiers [lex.name]</a><br>     <a href="#old.lex.icon">2.14.2 Integer literals [lex.icon]</a><br>     <a href="#old.lex.fcon">2.14.3 Floating literals [lex.fcon]</a><br>     <a href="#old.lex.ext">2.14.8 User-defined literals [lex.ext]</a><br>     <a href="#old.cpp">16 Preprocessing directives [cpp]</a><br> <a href="#Approaches">Approaches</a><br>     <a href="#RemoveLit">Remove User-Defined Literals</a><br>     <a href="#Typographic">Typographic</a><br>     <a href="#GraveAccent">Grave Accent</a><br>     <a href="#SingleQuote">Single Quote</a><br>     <a href="#Underscore">Underscore</a><br>         <a href="#DoubleUnderscore">Double Underscore</a><br>         <a href="#ScopeOperator">Scope Operator</a><br>         <a href="#NonDigitSuffix">Non-Digit Literal Suffix</a><br>         <a href="#Spacing">Spacing</a><br>         <a href="#DoubleRadixPoint">Double Radix Point</a><br>         <a href="#Backslash">Backslash</a><br> <a href="#Proposal">Proposal</a><br>     <a href="#new.lex.ppnumber">2.10 Preprocessing numbers [lex.ppnumber]</a><br>     <a href="#new.lex.icon">2.14.2 Integer literals [lex.icon]</a><br>     <a href="#new.lex.fcon">2.14.4 Floating literals [lex.fcon]</a><br>     <a href="#new.lex.ext">2.14.8 User-defined literals [lex.ext]</a><br> <a href="#References">References</a><br> </p> <h2><a name="Problem">Problem</a></h2> <p> Numeric literals of more than a few digits are hard to read. Consider the following tasks. </p> <ul> <li>Pronounce <code>7237498123</code>.</li> <li>Compare <code>237498123</code> with <code>237499123</code> for equality.</li> <li>Decide whether <code>237499123</code> or <code>20249472</code> is larger.</li> </ul> <h2><a name="Solution">Solution</a></h2> <p> The problem has a long history of solutions in writing and typography, digit separators. In the English-speaking world, commas are usually used to separate digits. </p> <ul> <li>Pronounce <code>7,237,498,123</code>.</li> <li>Compare <code>237,498,123</code> with <code>237,499,123</code> for equality.</li> <li>Decide whether <code>237,499,123</code> or <code>20,249,472</code> is larger.</li> </ul> <p> We wish to introduce digit separators into C++. The exact syntax is still open. The remainder of this paper discusses various approaches to the solution. </p> <h2><a name="Constraints">Constraints</a></h2> <p> Constraints on digit separators arise from three distinct sources. </p> <h3><a name="Ambiguity">Program Ambiguity</a></h3> <p> Adding digit separators introduces the potential for ambiguous C++ programs. We would prefer to avoid ambiguity, and failing that would prefer to have usable rules for disambiguating the source. In particular, the interaction with user-defined literals <a href="#N2747">[N2747]</a> <a href="#N2765">[N2765]</a> should be carefully considered. </p> <h3><a name="Lexical">Lexical Language Compatibility</a></h3> <p> The lexical structure of C++ is shared with C, Objective C/C++, and other tools through the preprocessor. Any introduction of digit separators should carefully consider compatibility with the existing lexical structure of these languages. </p> <p> Richard Smith questions the value of compatibility here. </p> <blockquote> <p> This problem only arises if: </p> <ol> <li> Someone is attempting to write a file which is to be shared between C++14 and other languages, and </li> <li> They include text in that header which simply does not work in those other languages. </li> </ol> <p> I find it hard to believe that this will be a real problem, and it seems like a clear case of user error. (If you're writing a header which works in C and C++, the burden is on you to make sure it works in C). </p> <p> This is not a new issue. The same problem already exists with C++11's raw string literals, and to a lesser extent with user-defined-literals and with C's hex floats (which allow 'p+' within pp-numbers). </p> </blockquote> <h3><a name="Extension">Extension Language Compatibility</a></h3> <p> C++ is often used as the basis for extended languages, notably Objective C/C++, but also many languages that are smaller and less widely used. Invalidating those extension languages has costs that are hard to predict. </p> <h2><a name="Existing">Existing Grammar</a></h2> <p> The existing grammar provides both constraints and opportunities. </p> <h3><a name="old.lex.charset">2.3 Character sets [lex.charset]</a></h3> <p> Paragraph 1 is as follows. </p> <blockquote> <p> The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters: [<i>Footnote:</i> The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files. —<i>end footnote</i>] </p> <blockquote> <pre><code>a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " ' </code></pre> </blockquote> </blockquote> <p> Of particular note, the only printable ASCII characters not used in the C++ basic character set are <code>$</code> (dollar), <code>@</code> (commercial at sign), and <code>`</code> (grave accent, back tick). All of these characters have been used for extension characters. Dollar has also been used as an identifier character, e.g. in VAX/VMS system functions names. </p> <h3><a name="old.lex.pptoken">2.5 Preprocessing tokens [lex.pptoken]</a></h3> <p> The grammar is as follows. </p> <blockquote> <dl> <dt><var>preprocessing-token:</var></dt> <dd><var>header-name</var></dd> <dd><var>identifier</var></dd> <dd><var>pp-number</var></dd> <dd><var>character-literal</var></dd> <dd><var>user-defined-character-literal</var></dd> <dd><var>string-literal</var></dd> <dd><var>user-defined-string-literal</var></dd> <dd><var>preprocessing-op-or-punc</var></dd> <dd>each non-white-space character that cannot be one of the above</dd> </dl> </blockquote> <p> Paragraph two is of special note. </p> <blockquote> <p> A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a <code>'</code> or a <code>"</code> character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (2.8), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause 16, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal. </p> </blockquote> <p> The implication here is that no valid C++ program should have an isolated single or double quote character. Unfortunately, that information is less useful that it might appear because an isolated single quote could be in use to signal an extension language interpretation. </p> <h3><a name="old.lex.ppnumber">2.10 Preprocessing numbers [lex.ppnumber]</a></h3> <p> The grammar is as follows. </p> <blockquote> <dl> <dt><var>pp-number:</var></dt> <dd><var>digit</var></dd> <dd><code>.</code> <var>digit</var></dd> <dd><var>pp-number digit</var></dd> <dd><var>pp-number nondigit</var></dd> <dd><var>pp-number</var> <code>e</code> <var>sign</var></dd> <dd><var>pp-number</var> <code>E</code> <var>sign</var></dd> <dd><var>pp-number</var> <code>.</code></dd> </dl> </blockquote> <p> We would like numeric literals to fit within this syntax, as it would require the least change to existing tools, e.g editor syntax highlighting and mouse word grabbing. </p> <h3><a name="old.lex.name">2.11 Identifiers [lex.name]</a></h3> <p> The grammar is as follows. </p> <blockquote> <dl> <dt><var>nondigit:</var> one of</dt> <dd><code>a b c d e f g h i j k l m</code></dd> <dd><code>n o p q r s t u v w x y z</code></dd> <dd><code>A B C D E F G H I J K L M</code></dd> <dd><code>N O P Q R S T U V W X Y Z _</code></dd> <dt><var>digit:</var> one of</dt> <dd><code>0 1 2 3 4 5 6 7 8 9</code></dd> </dl> </blockquote> <p> The implication in this grammar is that ignored code must still be made up of valid tokens. </p> <h3><a name="old.lex.icon">2.14.2 Integer literals [lex.icon]</a></h3> <p> The grammar is as follows. </p> <blockquote> <dl> <dt><var>integer-literal:</var></dt> <dd><var>decimal-literal integer-suffix<sub>opt</sub></var></dd> <dd><var>octal-literal integer-suffix<sub>opt</sub></var></dd> <dd><var>hexadecimal-literal integer-suffix<sub>opt</sub></var></dd> <dt><var>decimal-literal:</var></dt> <dd><var>nonzero-digit</var></dd> <dd><var>decimal-literal digit</var></dd> <dt><var>octal-literal:</var></dt> <dd><code>0</code></dd> <dd><var>octal-literal octal-digit</var></dd> <dt><var>hexadecimal-literal:</var></dt> <dd><code>0x</code> <var>hexadecimal-digit</var></dd> <dd><code>0X</code> <var>hexadecimal-digit</var></dd> <dd><var>hexadecimal-literal hexadecimal-digit</var></dd> <dt><var>nonzero-digit:</var> one of</dt> <dd><code>1 2 3 4 5 6 7 8 9</code></dd> <dt><var>octal-digit:</var> one of</dt> <dd><code>0 1 2 3 4 5 6 7</code></dd> <dt><var>hexadecimal-digit:</var> one of</dt> <dd><code>0 1 2 3 4 5 6 7 8 9</code></dd> <dd><code>a b c d e f</code></dd> <dd><code>A B C D E F</code></dd> </dl> </blockquote> <p> This syntax is entirely contained with the <code>pp-number</code> syntax. </p> <h3><a name="old.lex.fcon">2.14.3 Floating literals [lex.fcon]</a></h3> <p> The grammar is as follows. </p> <blockquote> <dl> <dt><var>floating-literal:</var></dt> <dd><var>fractional-constant exponent-part<sub>opt</sub> floating-suffix<sub>opt</sub></var></dd> <dd><var>digit-sequence exponent-part floating-suffix<sub>opt</sub></var></dd> <dt><var>fractional-constant:</var></dt> <dd><var>digit-sequence<sub>opt</sub></var> <code>.</code> <var>digit-sequence</var></dd> <dd><var>digit-sequence</var> <code>.</code></dd> <dt><var>exponent-part:</var></dt> <dd><code>e</code> <var>sign<sub>opt</sub> digit-sequence</var></dd> <dd><code>E</code> <var>sign<sub>opt</sub> digit-sequence</var></dd> <dt><var>sign:</var> one of</dt> <dd><code>+ -</code></dd> <dt><var>digit-sequence:</var></dt> <dd><var>digit</var></dd> <dd><var>digit-sequence digit</var></dd> </dl> </blockquote> <p> This syntax is entirely contained with the <code>pp-number</code> syntax. </p> <h3><a name="old.lex.ext">2.14.8 User-defined literals [lex.ext]</a></h3> <p> The grammar is as follows. </p> <blockquote> <dl> <dt><var>user-defined-literal:</var></dt> <dd><var>user-defined-integer-literal</var></dd> <dd><var>user-defined-floating-literal</var></dd> <dd><var>user-defined-string-literal</var></dd> <dd><var>user-defined-character-literal</var></dd> <dt><var>user-defined-integer-literal:</var></dt> <dd><var>decimal-literal ud-suffix</var></dd> <dd><var>octal-literal ud-suffix</var></dd> <dd><var>hexadecimal-literal ud-suffix</var></dd> <dt><var>user-defined-floating-literal:</var></dt> <dd><var>fractional-constant exponent-part<sub>opt</sub> ud-suffix</var></dd> <dd><var>digit-sequence exponent-part ud-suffix</var></dd> <dt><var>user-defined-string-literal:</var></dt> <dd><var>string-literal ud-suffix</var></dd> <dt><var>user-defined-character-literal:</var></dt> <dd><var>character-literal ud-suffix</var></dd> <dt><var>ud-suffix:</var></dt> <dd><var>identifier</var></dd> </dl> </blockquote> <h3><a name="old.cpp">16 Preprocessing directives [cpp]</a></h3> <p> The grammar is as follows. </p> <blockquote> <dl> <dt><var>text-line:</var></dt> <dd><var>pp-tokens<sub>opt</sub> new-line</var></dd> <dt><var>pp-tokens:</var></dt> <dd><var>preprocessing-token</var></dd> <dd><var>pp-tokens preprocessing-token</var></dd> </dl> </blockquote> <p> The implication here is that <code>#if</code>-ignored program source must still be made up of valid preprocessor tokens, not arbitrary text. Many preprocessors will skip arbitrary text, though. </p> <h2><a name="Approaches">Approaches</a></h2> <p> There are several approaches to the solution. We evaluate them in turn. </p> <h3><a name="RemoveLit">Remove User-Defined Literals</a></h3> <p> At least Daveed Vandevoorde and N.M. Maclaren have suggested removing user-defined literals. However, removing a feature that we just introduced could be difficult. </p> <h3><a name="Typographic">Typographic</a></h3> <p> There are three primary typographic conventions for digit separators: a comma, base-line dot, and a (thin) space. </p> <p> C++ already uses the comma for an operator, and using it for a digit separator would introduce ambiguities in expressions such as <code>++a-3,4-b++</code>, or even more simply, <code>f(12,345)</code>. </p> <p> C++ already uses the base-line dot as a radix point, and so it is essentially not usable as a digit separator. </p> <p> Bjarne Stroustrup has suggested using a space as a separator. </p> <ul> <li>Pronounce <code>7 237 498 123</code>.</li> <li>Compare <code>237 498 123</code> with <code>237 499 123</code> for equality.</li> <li>Decide whether <code>237 499 123</code> or <code>20 249 472</code> is larger.</li> </ul> <p> While this approach is consistent with one common typeographic style, it suffers from some compatibility problems. </p> <ul> <li> It does not match the syntax for a <var>pp-number</var>, and would minimally require extending that syntax. </li> <li> More importantly, there would be some syntactic ambiguity when a hexadecimal digit in the range [a-f] follows a space. The preprocessor would not know whether to perform symbol substitution starting after the space. </li> <li> It would likely make editing tools that grab "words" less reliable. </li> </ul> <h3><a name="GraveAccent">Grave Accent</a></h3> <p> Ville Voutilainen, among others, suggests using a grave accent (`) (back tick) as a digit separator. <p> <ul> <li>Pronounce <code>7`237`498`123</code>.</li> <li>Compare <code>237`498`123</code> with <code>237`499`123</code> for equality.</li> <li>Decide whether <code>237`499`123</code> or <code>20`249`472</code> is larger.</li> </ul> <p> This character is not part of the C++ basic source character set. The proposal has the advantage that introducing for this purpose cannot yield any ambiguity with existing C++ code. There are two disadvantages. First, using this character in the language invalidates any meta-languages using this character to distinguish between the C++ base layer and any meta information. Second, existing preprocessors would not recognize the grave accent as part of a preprocessor number, and may thus yield incorrect results. </p> <h3><a name="SingleQuote">Single Quote</a></h3> <p> Daveed Vandevoorde suggests using a single quote <a href="#N2747">[N2747]</a>. The single quote can be thought of as an "upper comma". </p> <ul> <li>Pronounce <code>7'237'498'123</code>.</li> <li>Compare <code>237'498'123</code> with <code>237'499'123</code> for equality.</li> <li>Decide whether <code>237'499'123</code> or <code>20'249'472</code> is larger.</li> </ul> <p> There are two problems with this approach. First, an odd number of single quotes would result in a line of text that does not meet the preprocessor syntax for a token. While most preprocessors do not tokenize lines that are ignored in <code>#if</code>/<code>#else</code>, some preprocessors are known to emit errors for such cases. Second, existing preprocessors would not recognize the single quote as part of a preprocessor number, and may thus yield incorrect results. </p> <p> Daveed Vandevoorde explains the incompatibility in more detail. </p> <blockquote> <p> For example: </p> <blockquote><pre><code>#if defined(__cplusplus) double pie = 3.141'593; #endif</code></pre></blockquote> <p> In C, the preprocessor-tokens that are <code>#if</code>'ed out are (not including the double quotes) "<code>double</code>", "<code>pie</code>", "<code>=</code>", "<code>3.141</code>", "<code>'</code>", "<code>593</code>", and "<code>;</code>". </p> <p> However, single and double quotes that aren't part of a larger <var>preprocessor-token</var> are deemed undefined behavior (C99, 6.4/3). </p> <p> Typical C compilers (GCC, clang, EDG, and MSVC for example) have no problem with it (presumably they don't try to tokenize #if'ed-out lines), but James Dennett mentioned at least one older C compiler didn't like it. </p> </blockquote> <p> Pete Becker points out that many tools, such as syntax highlighting in editors, rely on quotes being paired. The adaptability of the tools to new expressions is an open issue. </p> <p> N.M. Maclaren suggests that single quote will lead to very bad error messages with some macro-based libraries. </p> <h3><a name="Underscore">Underscore</a></h3> <p> The Ada programming language uses an underscore (technically, a low line) for the digit separator <a href="#AdaLRMnumlit">[AdaLRMnumlit]</a> <a href="#AdaRDnumlit">[AdaRDnumlit]</a>. This approach seems to be used in VHDL and Verilog, also possibly in Algol68. (VHDL also appears to have literal suffixes.) This approach has been proposed more than once for C++, going at least as far back as 1993 <a href="#N0259">[N0259]</a>. </p> <ul> <li>Pronounce <code>7_237_498_123</code>.</li> <li>Compare <code>237_498_123</code> with <code>237_499_123</code> for equality.</li> <li>Decide whether <code>237_499_123</code> or <code>20_249_472</code> is larger.</li> </ul> <p> In all known cases, the primary proposal has been to permit only a single underscore between digits <a href="#N0259">[N0259]</a> <a href="#N2281">[N2281]</a> <a href="#N3342">[N3342]</a>. However, <a href="#N0259">[N0259]</a> presents an option to permit underscores between the digit sequence and any prefix or suffix. </p> <p> Underscores work well as a digit separator for C++03 <a href="#N0259">[N0259]</a> <a href="#N2281">[N2281]</a>. But with C++11, there exists a potential ambiguity with user-defined literals <a href="#N2747">[N2747]</a>. While the likely resolution will be some form of "max munch" rule, some mechanism must be present to disambiguate when max munch is too much. We use the term suffix separator to indicate this mechanism. </p> <h4><a name="DoubleUnderscore">Double Underscore</a></h4> <p> <a href="#N2747">[N2747]</a> suggests a double underscore as a suffix separator. </p> <p> Mike Miller provides more detail. </p> <blockquote> <p> ... one possibility that occurs to me would be to allow a trailing underscore in an integer literal. The ambiguity with user-defined literals would be resolved in favor of the plain integer literal; a user could disambiguate a user-defined literal by ending the integer part with a trailing underscore. (Double underscores would not be permitted in an integer literal.) Thus: </p> <blockquote><p> <code>1_</code> => <code>1</code><br> <code>1_2</code> => <code>12</code><br> <code>1__2</code> => value <code>1</code> passed to <code>operator "" _2</code><br> <code>0xdead_bee_f</code> => <code>0xdeadbeef</code><br> <code>0xdead_bee__f</code> => value <code>0xdeadbee</code> passed to <code>operator "" _f</code> </p></blockquote> </blockquote> <p> The ambiguity with this approach arises when the suffix begins with one or more underscores. </p> <p> John Spicer suggests something slightly different. </p> <blockquote> <p> At some point I had suggested using underscore and having a special lookup rule so that something like <code>0xabc_de</code> would look for the "<code>de</code>" user-defined literal operator, and if not found, would treat the "<code>de</code>" as part of the hex literal. If you wanted to force the use of the operator, you could write <code>0xabc__de</code>. If you wanted to force the use of a <code>_de</code> operator, you would have to write <code>0xabc___de</code>. </p> <p> Another alternative would be to look for the "<code>de</code>" form and then the "<code>_de</code>" form if the first was not found. That way would only require the use of three underscores in cases where you had both a "<code>de</code>" and "<code>_de</code>" operator and wanted to force use of the second. </p> </blockquote> <h4><a name="ScopeOperator">Scope Operator</a></h4> <p> <a href="#N2747">[N2747]</a> suggests the scope operator (<code>::</code>) as a potential suffix separator. The scope operator would be a pure syntactic extension, as it could not otherwise follow a literal. However, it would make substrings of a literal separately subject to preprocessor symbol substitution. </p> <h4><a name="NonDigitSuffix">Non-Digit Literal Suffix</a></h4> <p> <a href="#N3342">[N3342]</a> suggests disallowing a leading underscore followed by a digit as a user-defined literal suffix. The intent was to make a suffix separator unnecessary. However, <a href="#N3448">[N3448]</a> points out that <a href="#N3342">[N3342]</a> fails to disambiguate hexadecimal digits, particularly in hte example <code>0xdead_beef_db</code>, where <code>db</code> could be either decibel or the hexadecimal digits <code>d</code> and <code>b</code>. </p> <p> One could simply not allow user-defined literals with hexadecimal literals. However, this restriction is not desirable. </p> <h4><a name="Spacing">Spacing</a></h4> <p> Discussions in the October 2012 standards meeting settled on using whitespace as the suffix separator. Unfortunately, that approach causes parsing problems for Objective C/C++. </p> <p> Richard Smith explains. </p> <blockquote> <p> An Objective-C message send works like this: </p> <dl> <dt><var>message-expression:</var></dt> <dd><code>[</code> <var>expression message-selector</var> <code>]</code></dd> <dt><var>message-selector:</var></dt> <dd><var>identifier</var></dd> <dd><var>keyword-arguments</var></dd> <dt><var>keyword-arguments:</var></dt> <dd><var>identifier<sub>opt</sub> : expression keyword-arguments<sub>opt</sub></var> </dl> <p> In particular, this is a valid Objective-C message send: </p> <pre><code>[self setValue: 0xff units: "cm"]</code></pre> <p> Hence any proposal which folds a <var>pp-number</var> followed by an identifier into a single literal will break a significant quantity of Objective-C code. </p> </blockquote> <p> Doug Gregor elaborates. </p> <blockquote> <p> There are two issues with allowing spaces between a literal and its suffix for Objective-C. One is a true ambiguity and one is a problem for error recovery. </p> <p> The true ambiguity occurs because one can omit a parameter name from the method declaration, in which case there is no identifier before the ':' in the call. For example, one could have a message send that looks like this: <p> <pre><code>[a method:10 :11]</code></pre> <p> which calls the method "<code>method::</code>". Now, consider </p> <pre><code>[a method:10 _suffix:11]</code></pre> <p> Currently, this parses (unambiguously) as a message send to "<code>method:_suffix:</code>", i.e., it's parsed as </p> <pre><code>[a method:(10) _suffix:11] // _suffix is the name of the second argument; calls method:_suffix:</code></pre> <p> However, if we allow a space between a literal and its suffix, there is a second potential parse: </p> <pre><code>[a method:(10_suffix) :11] // _suffix is a suffix to the literal 10; calls method::</code></pre> <p> which is completely ambiguous. </p> <p> The error-recovery issue is that Objective-C(++) parsers tend to rely heavily on the fact that an expression in C/C++ cannot be immediately followed by an identifier. If we see an expression followed by an identifier in an expression context, it's fairly likely that this is a message send for which the '[' has been dropped. For example, Clang detects these cases and automatically inserts the '[' for the user; this was one of the top error-recovery requests, and a regression here would be considered a major problem for our users. </p> </blockquote> <h4><a name="DoubleRadixPoint">Double Radix Point</a></h4> <p> Jeremiah Willcock suggests using "<code>..</code>" as the suffix separator. This notation is already permitted by the <var>pp-number</var> syntax. It is also not presently permitted by any numeric literal. Its primary disadvantage seems to be that it is unfamilar. </p> <h4><a name="Backslash">Backslash</a></h4> <p> Clark Nelson suggests using "<code>\</code>" as the suffix separator. This notation is not permitted by the <var>pp-number</var> syntax. It is also not presently permitted by any numeric literal. </p> <h2><a name="Proposal">Proposal</a></h2> <p> In this section we present likely wording edits, parameterized by the possible choices. </p> <h3><a name="new.lex.ppnumber">2.10 Preprocessing numbers [lex.ppnumber]</a></h3> <p> Edit the grammar as follows. Note that the additional rule for <var>pp-number</var> may not be necessary, depending on the specific chosen format. </p> <blockquote> <dl> <dt><ins><var>digit-separator:</var></ins></dt> <dd><ins><var><strong>to be determined</strong></var></ins></dd> <dt><var>pp-number:</var></dt> <dd><var>digit</var></dd> <dd><code>.</code> <var>digit</var></dd> <dd><var>pp-number digit</var></dd> <dd><var>pp-number nondigit</var></dd> <dd><var>pp-number</var> <code>e</code> <var>sign</var></dd> <dd><var>pp-number</var> <code>E</code> <var>sign</var></dd> <dd><var>pp-number</var> <code>.</code></dd> <dd><ins><var>pp-number digit-separator</var></ins></dd> </dl> </blockquote> <h3><a name="new.lex.icon">2.14.2 Integer literals [lex.icon]</a></h3> <p> Edit the grammar as follows. </p> <blockquote> <dl> <dt><var>integer-literal:</var></dt> <dd><var>decimal-literal integer-suffix<sub>opt</sub></var></dd> <dd><var>octal-literal integer-suffix<sub>opt</sub></var></dd> <dd><var>hexadecimal-literal integer-suffix<sub>opt</sub></var></dd> <dt><var>decimal-literal:</var></dt> <dd><var>nonzero-digit</var></dd> <dd><var>decimal-literal <ins>digit-separator<sub>opt</sub></ins> digit</var></dd> <dt><var>octal-literal:</var></dt> <dd><code>0</code></dd> <dd><var>octal-literal <ins>digit-separator<sub>opt</sub></ins> octal-digit</var></dd> <dt><var>hexadecimal-literal:</var></dt> <dd><code>0x</code> <var>hexadecimal-digit</var></dd> <dd><code>0X</code> <var>hexadecimal-digit</var></dd> <dd><var>hexadecimal-literal <ins>digit-separator<sub>opt</sub></ins> hexadecimal-digit</var></dd> <dt><var>nonzero-digit:</var> one of</dt> <dd><code>1 2 3 4 5 6 7 8 9</code></dd> <dt><var>octal-digit:</var> one of</dt> <dd><code>0 1 2 3 4 5 6 7</code></dd> <dt><var>hexadecimal-digit:</var> one of</dt> <dd><code>0 1 2 3 4 5 6 7 8 9</code></dd> <dd><code>a b c d e f</code></dd> <dd><code>A B C D E F</code></dd> </dl> </blockquote> <p> Edit paragraph 1 as follows. Note that each <code><b>?</b></code> will be replaced by the actual chosen digit separator character(s). </p> <blockquote> <p> An <dfn>integer literal</dfn> is a sequence of digits that has no period or exponent part<ins>, with optional digit separators. These separators are ignored when determining its value</ins>. .... [<i>Example:</i> <del>the</del> <ins>The</ins> number twelve can be written <code>12</code>, <code>014</code>, or <code>0XC</code>. <ins>The literals <code>1048576</code>, <code>1<b>?</b>048<b>?</b>576</code>, <code>0X100000</code>, <code>0x10<b>?</b>0000</code>, and <code>0<b>?</b>004<b>?</b>000<b>?</b>000</code> all have the same value.</ins> —<i>end example</i>] </p> </blockquote> <h3><a name="new.lex.fcon">2.14.4 Floating literals [lex.fcon]</a></h3> <p> Edit the grammar as follows. </p> <blockquote> <dl> <dt><var>floating-literal:</var></dt> <dd><var>fractional-constant exponent-part<sub>opt</sub> floating-suffix<sub>opt</sub></var></dd> <dd><var>digit-sequence exponent-part floating-suffix<sub>opt</sub></var></dd> <dt><var>fractional-constant:</var></dt> <dd><var>digit-sequence<sub>opt</sub></var> <code>.</code> <var>digit-sequence</var></dd> <dd><var>digit-sequence</var> <code>.</code></dd> <dt><var>exponent-part:</var></dt> <dd><code>e</code> <var>sign<sub>opt</sub> digit-sequence</var></dd> <dd><code>E</code> <var>sign<sub>opt</sub> digit-sequence</var></dd> <dt><var>sign:</var> one of</dt> <dd><code>+ -</code></dd> <dt><var>digit-sequence:</var></dt> <dd><var>digit</var></dd> <dd><var>digit-sequence <ins>digit-separator<sub>opt</sub></ins> digit</var></dd> </dl> </blockquote> <p> Edit within paragraph 1 as follows. Note that each <code><b>?</b></code> will be replaced by the actual chosen digit separator character(s). </p> <blockquote> <p> .... The integer and fraction parts both consist of a sequence of decimal (base ten) digits<ins>, with optional digit separators</ins>. <ins>These separators are ignored when determining its value. [<i>Example:</i> The literals <code>1.602<b>?</b>176<b>?</b>565e-19</code> and <code>1.602176565e-19</code> have the same value. —<i>end example</i>]</ins> .... </p> </blockquote> <h3><a name="new.lex.ext">2.14.8 User-defined literals [lex.ext]</a></h3> <p> Edit the grammar as follows. </p> <blockquote> <dl> <dt><var>user-defined-literal:</var></dt> <dd><var>user-defined-integer-literal</var></dd> <dd><var>user-defined-floating-literal</var></dd> <dd><var>user-defined-string-literal</var></dd> <dd><var>user-defined-character-literal</var></dd> <dt><var>user-defined-integer-literal:</var></dt> <dd><var>decimal-literal <del>ud-suffix</del> <ins>separated-suffix</ins></var></dd> <dd><var>octal-literal <del>ud-suffix</del> <ins>separated-suffix</ins></var></dd> <dd><var>hexadecimal-literal <del>ud-suffix</del> <ins>separated-suffix</ins></var></dd> <dt><var>user-defined-floating-literal:</var></dt> <dd><var>fractional-constant exponent-part<sub>opt</sub> <del>ud-suffix</del> <ins>separated-suffix</ins></var></dd> <dd><var>digit-sequence exponent-part <del>ud-suffix</del> <ins>separated-suffix</ins></var></dd> <dt><var>user-defined-string-literal:</var></dt> <dd><var>string-literal ud-suffix</var></dd> <dt><var>user-defined-character-literal:</var></dt> <dd><var>character-literal ud-suffix</var></dd> <dt><ins><var>separated-suffix:</var></ins></dt> <dd><ins><var>literal-separator<sub>opt</sub> ud-suffix</var></ins></dd> <dt><ins><var>literal-separator:</var></ins></dt> <dd><ins><var><strong>to be determined</strong></var></ins></dd> <dt><var>ud-suffix:</var></dt> <dd><var>identifier</var></dd> </dl> </blockquote> <p> Edit paragraph 1 as follows. Note that each <code><b>?</b></code> will be replaced by the actual chosen digit separator character(s) and each <code><b>??</b></code> will be replaced by the actual chosen literal separator character(s). </p> <blockquote> <p> If a token matches both <var>user-defined-literal</var> and another literal kind, it is treated as the latter. [<i>Example:</i> <code>123_km</code> <ins>and <code>123<b>??</b>km</code></ins> <del>is a <var>user-defined-literal</var></del> <ins>are <var>user-defined-literal</var>s</ins>, but <ins>123<b>?</b>456 and</ins> 12LL <del>is an <var>integer-literal</var></del> <ins>are <var>integer-literal</var>s</ins> —<i>end example</i>] .... </blockquote> <h2><a name="References">References</a></h2> <dl> <dt><a name="N0259">[N0259]</a></dt> <dd> <cite>A proposal to allow Binary Literals, and some other small changes to Chapter 2: Lexical Conventions</cite>, John Max Skaller, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/1993/N0259.pdf"> N0259</a>, 1993-03-26 </dd> <dt><a name="N2281">[N2281]</a></dt> <dd> <cite>Digit Separators</cite>, Lawrence Crowl, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2281.html"> N2281</a>, 2007-05-02 </dd> <dt><a name="N2747">[N2747]</a></dt> <dd> <cite>Ambiguity and Insecurity with User-Defined Literals</cite>, Lawrence Crowl, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2747.html"> N2747</a>, 2008-08-24 </dd> <dt><a name="N2765">[N2765]</a></dt> <dd> <cite>User-defined Literals (aka. Extensible Literals (revision 5))</cite>, Ian McIntosh, Michael Wong, Raymond Mak, Robert Klarer, Jens Maurer, Alisdair Meredith, Bjarne Stroustrup, David Vandevoorde, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2765.pdf"> N2765</a>, 2008-09-18 </dd> <dt><a name="N3250">[N3250]</a></dt> <dd> <cite>US-18: Removing User-Defined Literals</cite>, Douglas Gregor, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3250.html"> N3250</a>, 2011-02-28 </dd> <dt><a name="N3402">[N3402]</a></dt> <dd> <cite>User-defined Literals for Standard Library Types</cite>, Peter Sommerlad, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3402.html"> N3402</a>, 2012-09-07 </dd> <dt><a name="N3342">[N3342]</a></dt> <dd> <cite>Digit Separators coming back</cite>, Jens Maurer, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3342.html"> N3342</a>, 2012-01-09 </dd> <dt><a name="N3448">[N3448]</a></dt> <dd> <cite>Painless Digit Separation</cite>, Daveed Vandevoorde, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3448.pdf"> N3448</a>, 2012-09-21 </dd> <dt><a name="N3472">[N3472]</a></dt> <dd> <cite>Binary Literals in the C++ Core Language</cite>, James Dennett, ISO/IEC JTC1 SC22 WG21 <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3472.pdf"> N3472</a>, 2012-10-19 </dd> <dt><a name="AdaLRMnumlit">[AdaLRMnumlit]</a></dt> <dd> <cite>Ada '83 Language Reference Manual</cite>, Section 2.4 Numeric Literals, <a href="http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#2.4"> http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#2.4</a> </dd> <dt><a name="AdaRDnumlit">[AdaRDnumlit]</a></dt> <dd> <cite>Rationale for the Design of the Ada Programming Language</cite>, Section 2.1 Lexical Structure <a href="http://archive.adaic.com/standards/83rat/html/ratl-02-01.html#2.1"> http://archive.adaic.com/standards/83rat/html/ratl-02-01.html#2.1</a> </dd> </dl> </body> </html>