CINXE.COM
Regular Expression
<!DOCTYPE html> <html lang='en'> <head> <meta charset='utf-8'> <meta name='viewport' content='width=device-width, initial-scale=1'> <meta name='description' content='Tclers wiki'> <meta name='author' content=''> <link rel='icon' href='/img/favicon.ico'> <title>Regular Expression</title> <!-- Latest compiled and minified CSS --> <link rel='stylesheet' href='https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css'> <link rel='stylesheet' href='/css/nikit.css' type='text/css'> <link rel='stylesheet' href='/css/sh_style.css' type='text/css'> <link rel='stylesheet' href='https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.0/css/theme.bootstrap_3.min.css' type='text/css'> <script src='//cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.0/clipboard.min.js'></script> </head> <body onload='sh_highlightDocument(); nikitUser();'> <nav class='navbar navbar-fixed-top navbar-inverse'> <div class='container'> <div class='navbar-header'> <button type='button' class='navbar-toggle' data-toggle='collapse' data-target='#myNavbar'> <span class='icon-bar'></span> <span class='icon-bar'></span> <span class='icon-bar'></span> </button> <ul class='nav navbar-nav'> <li class='dropdown'> <a class='dropdown-toggle' data-toggle='dropdown' href='#'> Tcler's Wiki<span class='caret'></span></a> <ul class='dropdown-menu scrollable-menu' role='menu'><li><a href='/welcome'>Home</a></li> <li><a rel='nofollow' href='/recent'>Changes</a></li> <li><a rel='nofollow' href='/_random'>Random page</a></li> <li><a rel='nofollow' href='/_new'>New page</a></li> </ul> </li> </ul> </div> <div class='collapse navbar-collapse' id='myNavbar'> <ul class='nav navbar-nav'> <li class='dropdown' id='li_idPageEdit' style='display:none'> <a class='dropdown-toggle' data-toggle='dropdown' href='#'><span id=name_idPageEdit>Page</span><span class='caret'></span></a> <ul class='dropdown-menu scrollable-menu' role='menu'><li><a rel='nofollow' href='/_edit/Regular+Expression?A=1'>Comment</a></li> <li><a rel='nofollow' hidden='true' href='/_edit/Regular+Expression'>Edit</a></li> <li><a rel='nofollow' href='/_upload/Regular+Expression'>Upload</a></li> <li><a rel='nofollow' href='/ref/Regular+Expression'>References</a></li> <li><a rel='nofollow' href='/history/Regular+Expression'>History</a></li> <li><hr></li> <li><a href='#8a7583448005a20dbc9ffa2cf18c9f1625193a3d96398eafd18610156d98f8ba'> See Also </a> </li> <li><a href='#924aa262a03933db391d2ae9c41fd89e48b590d7b2ccc91da65a3a04644f3ebd'> Resources </a> </li> <li><a href='#47742718299e82699bc374b81e6ab6b22c2b19c674703379e03143e714e9e3c1'> Description </a> </li> <li><a href='#6fe7befa7cf9f455e06bd2a496637dae411603e4ee38602ea3fd40ac0464c08e'> How Regular Expression Features are Implemented using Finite Automata </a> </li> <li><a href='#eb07bde6001cedda8d6ad5e154981eb9a7eeb8108be12da82987a17bcc670698'> Search mode regexps</a> </li> <li><a href='#dce6c589227eda7f3ef1322740725ee4c3fb1f2bc4bc2bfc34c3b42b4b9ac4ff'> Submatch capturing </a> </li> <li><a href='#d078d94f6528c50e7e1a5420a56fe2ec27569cf1d2447f9d111945f7581f24c9'> Boolean AND / language intersection </a> </li> <li><a href='#9950ab117254aabebc06e4e404603e881ef9ce76057eac975853154eca4b9a2c'> Boolean NOT / language complementation </a> </li> <li><a href='#b5bbfa2212f166d216ea7c26f532b375043354d622e42111134ebdd94bd388c0'> Lookahead/-behind constraints </a> </li> <li><a href='#0fc92bade57dd836ac53aa9e06f71a817a234af8fa3d3f00a9a1270dcc65e669'> Regular expression puzzles </a> </li> <li><a href='#d2867aa1281341e2e3c60281019f5ab8cb5ae297f2cd3f1fe487586bbd979a2b'> Regular expression flavours </a> </li> </ul> </li> <li class='dropdown' id='li_idPageNoEdit' style='display:none'> <a class='dropdown-toggle' data-toggle='dropdown' href='#'><span id=name_idPageNoEdit>Page</span><span class='caret'></span></a> <ul class='dropdown-menu scrollable-menu' role='menu'><li><a rel='nofollow' href='/ref/Regular+Expression'>References</a></li> <li><a rel='nofollow' href='/history/Regular+Expression'>History</a></li> <li><hr></li> <li><a href='#8a7583448005a20dbc9ffa2cf18c9f1625193a3d96398eafd18610156d98f8ba'> See Also </a> </li> <li><a href='#924aa262a03933db391d2ae9c41fd89e48b590d7b2ccc91da65a3a04644f3ebd'> Resources </a> </li> <li><a href='#47742718299e82699bc374b81e6ab6b22c2b19c674703379e03143e714e9e3c1'> Description </a> </li> <li><a href='#6fe7befa7cf9f455e06bd2a496637dae411603e4ee38602ea3fd40ac0464c08e'> How Regular Expression Features are Implemented using Finite Automata </a> </li> <li><a href='#eb07bde6001cedda8d6ad5e154981eb9a7eeb8108be12da82987a17bcc670698'> Search mode regexps</a> </li> <li><a href='#dce6c589227eda7f3ef1322740725ee4c3fb1f2bc4bc2bfc34c3b42b4b9ac4ff'> Submatch capturing </a> </li> <li><a href='#d078d94f6528c50e7e1a5420a56fe2ec27569cf1d2447f9d111945f7581f24c9'> Boolean AND / language intersection </a> </li> <li><a href='#9950ab117254aabebc06e4e404603e881ef9ce76057eac975853154eca4b9a2c'> Boolean NOT / language complementation </a> </li> <li><a href='#b5bbfa2212f166d216ea7c26f532b375043354d622e42111134ebdd94bd388c0'> Lookahead/-behind constraints </a> </li> <li><a href='#0fc92bade57dd836ac53aa9e06f71a817a234af8fa3d3f00a9a1270dcc65e669'> Regular expression puzzles </a> </li> <li><a href='#d2867aa1281341e2e3c60281019f5ab8cb5ae297f2cd3f1fe487586bbd979a2b'> Regular expression flavours </a> </li> </ul> </li> <li><a href="/page/Showcase">Showcase</a></li> <li><a href="/page/Tcl+Tutorial+Lesson+0">Tutorial</a></li> <li><a href="/page/Articles">Articles</a></li> <li><a href="/page/Tcl+Playground">Playground</a></li> <li class='dropdown'> <a class='dropdown-toggle' data-toggle='dropdown' href='#'> Help<span class='caret'></span></a> <ul class='dropdown-menu scrollable-menu' role='menu'><li><a rel='nofollow' href='/page/Help'>Page Markup</a></li> <li><a rel='nofollow' href='/page/How+do+Wiki+Categories+work'>Wiki Categories</a></li> <li><a rel='nofollow' href='/page/Contents'>Topics</a></li> <li><a rel='nofollow' target='_blank' href='https://chiselapp.com/user/stevel/repository/nikit/ticket'>Report Problems</a></li> <li><a rel='nofollow' href='/privacy'>Privacy</a></li> <li><a rel='nofollow' href='/license'>License</a></li> </ul> </li> </ul> <ul class='nav navbar-nav navbar-right'> <li class='dropdown'> <a class='dropdown-toggle' data-toggle='dropdown' href='#'><span id=name_SMenu>User</span><span class='caret'></span></a> <ul class='dropdown-menu' id='ul_SMenu'> </ul> </li> </ul> <form class='navbar-form navbar-right' method='post' action='/search' id='searchform'> <input name='Q' type='text' class='form-control' placeholder='Search...'/> <input type="hidden" name="sites" value="wiki.tcl-lang.org"/> </form> </div> </div> </nav> <div class='container'> <div class='row'> <div class='col-xs-12'> <h2>Regular Expression</h2> </div> </div> <div class='row'> <div class='col-xs-12'> <p class='mkup_p'>A <b class='mkup_b'>regular expression</b> defines a pattern in text. This page is about regular expressions in general, not necessarily Tcl <a class='mkup_a mkup_known' href='/page/Regular+Expressions'>regular Expressions</a> in Tcl.</p> <h2 id='8a7583448005a20dbc9ffa2cf18c9f1625193a3d96398eafd18610156d98f8ba' class='mkup_h1'> See Also </h2><dl class='mkup_dl'><dt class='mkup_dt'><a class='mkup_a mkup_known' href='/page/Regular+Expressions'>Regular Expressions</a></dt><dd class='mkup_dd'>describes Tcl's regular expressions (<a class='mkup_a mkup_known' href='/page/ARE'>advanced regular expressions</a>)</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html'>The Open Group Base Specifications Issue 7, Regular Expressions <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>, AKA <a class='mkup_a mkup_known' href='/page/posix'>POSIX</a></dt><dd class='mkup_dd'></dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://www.regular-expressions.info/'>regularexpressions.info <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a></dt><dd class='mkup_dd'>lots of general information, and a thorough tutorial with examples (<a class='mkup_a mkup_known' href='/page/CL'>CL</a>: although its imprecisions exasperate). See the <a rel='nofollow' class='mkup_a' href='http://regular-expressions.info/examples.html'>examples <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a> for an extensive listing of regulare expressions for various tasks.</dd></dl> <dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href="https://en.wikipedia.org/wiki/Regular_language">Regular Language, Wikipedia <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a></dt><dd class='mkup_dd'>a formal definitions of "regular language".</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a class='mkup_a mkup_known' href='/page/BOOK+Mastering+Regular+Expressions'>BOOK Mastering Regular Expressions</a>, by <a class='mkup_a mkup_known' href='/page/Jeffrey+E%2E+F%2E+Friedl'>Jeffrey E. F. Friedl</a></dt><dd class='mkup_dd'>considered almost the definitive tome on regular expressions, from the Unix <b class='mkup_b'>grep</b>(1) command to Tcl and beyond.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://blog.staffannoteberg.com/2013/01/30/regular-expressions-a-brief-history/'>Regular Expressions - a brief history <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>, by Staffan Nöteberg, 2013-01-30</dt><dd class='mkup_dd'></dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://www2.imm.dtu.dk/~phbi/files/talks/2012remhsacS.pdf'>Regular Expression Matching: History, Status, and Challenges <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>, Philip Bille</dt><dd class='mkup_dd'></dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://web.archive.org/web/20120704232743/http://zez.org/article/articleprint/11'>Regular Expressions explained <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>, Jan Borsodi, 2000-10-30</dt><dd class='mkup_dd'></dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://www.onlamp.com/pub/a/onlamp/2003/08/21/regexp.html'>Five Habits for Successful Regular Expressions <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>, Tony Stubblebine, 2003-08-21</dt><dd class='mkup_dd'></dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://www.ibm.com/developerworks/aix/library/au-regexp/'>Know your regular expressions <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>, Michael Stutz, 2007-06-14</dt><dd class='mkup_dd'>features a flexible RE generator</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://www.regexlib.com/'>regexlib.com <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a></dt><dd class='mkup_dd'>a community maintained library of regular expressions</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://unicode.org/reports/tr18/'>Unicode Regular Expressions <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a></dt><dd class='mkup_dd'>Trippin' balls over at the Unicode Consortium.</dd></dl> <h2 id='924aa262a03933db391d2ae9c41fd89e48b590d7b2ccc91da65a3a04644f3ebd' class='mkup_h1'> Resources </h2><dl class='mkup_dl'><dt class='mkup_dt'><a rel='nofollow' class='mkup_a' href='http://regexplorer.sourceforge.net/'>RegExplorer <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a></dt><dd class='mkup_dd'></dd></dl> <h2 id='47742718299e82699bc374b81e6ab6b22c2b19c674703379e03143e714e9e3c1' class='mkup_h1'> Description </h2><p class='mkup_p'>Regular expressions were developed by Stephen Kleene in the 1950's for the purpose of specifying a regular language. Many tasks can be performed more simply using <span class='mkup_tt'><a class='mkup_a mkup_known' href='/page/string'>string</a></span> and <span class='mkup_tt'><a class='mkup_a mkup_known' href='/page/scan'>scan</a></span> instead of regular expressions. A regular expression often looks like a summary of the various forms that the set of strings it describes may take:</p><div class='sh_sourceCode'><button class='copybtn btn pull-right' data-clipboard-target='#mkup_code_0' title='Click to copy code snippet to clipboard'><span class='glyphicon glyphicon-copy' aria-hidden='true'></span></button><pre id='mkup_code_0' class='mkup_pre sh_sourceCode'>A((b|cc)a)*</pre></div><p class='mkup_p'>The following strings are from the infinite set of matches for this expression:</p><div class='sh_sourceCode'><button class='copybtn btn pull-right' data-clipboard-target='#mkup_code_1' title='Click to copy code snippet to clipboard'><span class='glyphicon glyphicon-copy' aria-hidden='true'></span></button><pre id='mkup_code_1' class='mkup_pre sh_sourceCode'>A Aba Acca Ababa Abacca Accaba Accacca Abababa</pre></div><p class='mkup_p'>Regular expressions are popular for their linear-time linear-time complexity: when a given string is to be matched against a regular expression, it is possible to do it so that every character in the string is only looked at once. This is attained by compiling the regular expression into a <a class='mkup_a mkup_known' href='/page/finite+automaton'>finite automaton</a> — potentially a big chunk of work, but one that only needs to be done once for each regular expression — and then running the automaton with the string as input.</p><p class='mkup_p'>Another important factor is that regular expressions can be used for efficiently <i class='mkup_i'>searching</i> through a large body of text. A direct implementation of the above would produce an algorithm for <i class='mkup_i'>matching</i> a string against a regular expression, but most RE implementations, including <span class='mkup_tt'><a class='mkup_a mkup_known' href='/page/regexp'>regexp</a></span>, play a few tricks internally that make them operate in search mode instead. In order to get matching behaviour (often useful with <a class='mkup_a mkup_known' href='/page/switch'>switch</a> -regexp), one uses the <i class='mkup_i'>anchors</i> <span class='mkup_tt'>^</span> and <span class='mkup_tt'>$</span> to require that the particular position in the regular expression must correspond to the beginning and end of the string respectively (caveat: sometimes it is beginning and end of line instead; <a class='mkup_a mkup_known' href='/page/ARE'>ARE</a>s have <span class='mkup_tt'>\A</span> and <span class='mkup_tt'>\Z</span> as alternatives).</p><p class='mkup_p'>Other common extensions to the regular expression syntax, which however doesn't make them any more powerful than the basic set described above, are:</p><dl class='mkup_dl'><dt class='mkup_dt'>One-or-more quantifier (<span class='mkup_tt'>+</span>)</dt><dd class='mkup_dd'>Similar to <span class='mkup_tt'>*</span>, but excluding the case of no repetition. The RE “<span class='mkup_tt'>(</span><i class='mkup_i'>re</i><span class='mkup_tt'>)+</span>” is equivalent to “<span class='mkup_tt'>(</span><i class='mkup_i'>re</i><span class='mkup_tt'>)(</span><i class='mkup_i'>re</i><span class='mkup_tt'>)*</span>” and “<span class='mkup_tt'>(</span><i class='mkup_i'>re</i><span class='mkup_tt'>)*(</span><i class='mkup_i'>re</i><span class='mkup_tt'>)</span>”.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Optional quantifier (<span class='mkup_tt'>?</span>)</dt><dd class='mkup_dd'>Also called the zero-or-one quantifier. The RE “<span class='mkup_tt'>(</span><i class='mkup_i'>re</i><span class='mkup_tt'>)?</span>” is equivalent to “<span class='mkup_tt'>(</span><i class='mkup_i'>re</i><span class='mkup_tt'>|)</span>”.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Bracket expression (<span class='mkup_tt'>[</span><i class='mkup_i'>chars</i><span class='mkup_tt'>]</span>)</dt><dd class='mkup_dd'>Effectively a shorthand — e.g. <span class='mkup_tt'>[abcd]</span> is equivalent to <span class='mkup_tt'>(a|b|c|d)</span> — but often far more compact. Unicode character classes often include thousands of character, so enumerating all of them would be unfeasible.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Counted quantifiers (<span class='mkup_tt'>{</span><i class='mkup_i'>n</i><span class='mkup_tt'>}</span> or <span class='mkup_tt'>{</span><i class='mkup_i'>m</i><span class='mkup_tt'>,</span><i class='mkup_i'>n</i><span class='mkup_tt'>}</span>)</dt><dd class='mkup_dd'>Exactly <i class='mkup_i'>n</i> occurrencies, or at least <i class='mkup_i'>m</i> and at most <i class='mkup_i'>n</i> occurrencies, respectively. Can as <span class='mkup_tt'>?</span> be reduced to combinations of parentheses and <span class='mkup_tt'>|</span>, but requires repeating the core RE the quantifier is applied to at least <i class='mkup_i'>n</i> times.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>AND</dt><dd class='mkup_dd'>Boolean "and", like <span class='mkup_tt'>|</span> is boolean "or". Uncommon in regexp engines, and not easily reduced to the fundamental operations, but nonetheless the intersection of two regular languages is again a regular language. The <a class='mkup_a mkup_known' href='/page/grammar%5Ffa'>grammar_fa</a> package uses <span class='mkup_tt'>&</span> to denote this.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>NOT</dt><dd class='mkup_dd'>Boolean negation. Uncommon in regexp engines, and not easily reduced to the fundamental operations, but nonetheless the complement of a regular language is also a regular language. The <a class='mkup_a mkup_known' href='/page/grammar%5Ffa'>grammar_fa</a> package uses <span class='mkup_tt'>!</span> to denote this.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Reversal</dt><dd class='mkup_dd'>Change the direction in which the string is being matched; this may be useful to implement searching backwards in a text editor. There is no common syntax for this, but by hand transforming a regular expression accordingly is typically straightforward.</dd></dl><p class='mkup_p'>A somewhat intriguing class of such extensions are the <i class='mkup_i'>constraints</i>, which only match the empty string but refuse to do so unless the material surrounding the position of this match satisfies some condition. Here we find:</p><dl class='mkup_dl'><dt class='mkup_dt'>Positive lookahead (<span class='mkup_tt'>(?=</span><i class='mkup_i'>re</i><span class='mkup_tt'>)</span>)</dt><dd class='mkup_dd'>The regular expression <i class='mkup_i'>re</i> must match the text that follows after this position. This is similar to AND, but different in that <i class='mkup_i'>re1</i> and <i class='mkup_i'>re2</i> in <span class='mkup_tt'>((?=</span><i class='mkup_i'>re1</i><span class='mkup_tt'>)</span><i class='mkup_i'>re2</i><span class='mkup_tt'>)</span> are not required to match the same characters; <i class='mkup_i'>re1</i> can match a prefix of what <i class='mkup_i'>re2</i> matches, or vice versa.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Negative lookhead (<span class='mkup_tt'>(?!</span><i class='mkup_i'>re</i><span class='mkup_tt'>)</span>)</dt><dd class='mkup_dd'>The regular expression <i class='mkup_i'>re</i> must <i class='mkup_i'>not</i> match the text that follows after this position.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Positive lookbehind (<span class='mkup_tt'>(?<=</span><i class='mkup_i'>re</i><span class='mkup_tt'>)</span>)</dt><dd class='mkup_dd'>The regular expression <i class='mkup_i'>re</i> must match the text that comes before this position. This is not available in <a class='mkup_a mkup_known' href='/page/ARE'>ARE</a>s.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Negative lookbehind (<span class='mkup_tt'>(?<!</span><i class='mkup_i'>re</i><span class='mkup_tt'>)</span>)</dt><dd class='mkup_dd'>The regular expression <i class='mkup_i'>re</i> must not match the text that comes before this position. This is not available in <a class='mkup_a mkup_known' href='/page/ARE'>ARE</a>s.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Beginning of word (<span class='mkup_tt'>\m</span>), end of word (<span class='mkup_tt'>\M</span>)</dt><dd class='mkup_dd'>These are obvious combinations of lookhead and lookbehind constraints, where one checks e.g. that the next character is a word character and the previous character was not.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Beginning or end of word (<span class='mkup_tt'>\y</span>), not beginning or end of word (<span class='mkup_tt'>\Y</span>)</dt><dd class='mkup_dd'>Slightly more (but only marginally so) complicated combinations of lookahead and lookbehind.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Beginning of line (<span class='mkup_tt'>^</span>), end of line (<span class='mkup_tt'>$</span>)</dt><dd class='mkup_dd'>Similar to beginning and end of word.</dd></dl><p class='mkup_p'>The Tcl regexp engine handles lookaheads by compiling and running the constraint RE separately; in this way it is a "hybrid" RE engine.</p><p class='mkup_p'>Another set of usual extensions to the syntax concern submatch extraction and greediness. These mainly become meaningful when regular expressions are used for <i class='mkup_i'>searching</i>, as there in that case often are several substrings that match a particular RE, and it matters what match is reported.</p><dl class='mkup_dl'><dt class='mkup_dt'>Non-greedy quantifiers</dt><dd class='mkup_dd'>Conventionally one lets quantifiers default to being greedy (match as much as possible), and introduce non-greedy quantifiers (match as little as possible) as variants; usual syntax is that a greedy quantifier followed by a <span class='mkup_tt'>?</span> becomes the corresponding non-greedy quantifier. Implementation-wise, the algorithm for non-greedy searching is easier than that for greedy searching.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Noncapturing parentheses</dt><dd class='mkup_dd'>Usual notation is <span class='mkup_tt'>(?:</span><i class='mkup_i'>re</i><span class='mkup_tt'>)</span>. The same as ordinary parentheses for searching and matching purposes, but tells the RE engine that it doesn't have to keep track of what range of a match corresponds to this parenthesis. Again, it is more work to keep track of this than to ignore it, but tradition dictates that parentheses should default to being capturing.</dd></dl><p class='mkup_p'>Many "regular expression" engines also support extensions to the syntax which allow them to go <i class='mkup_i'>beyond</i> the realm of regular languages. This is most common in backtracking RE engines (such as <a class='mkup_a mkup_known' href='/page/PCRE'>PCRE</a> and the <a class='mkup_a mkup_known' href='/page/Perl'>Perl</a> engine) which ignores the finite automaton theory and instead uses trial and error to find a match, but some have escaped into more general standards.</p><dl class='mkup_dl'><dt class='mkup_dt'>Back references (<span class='mkup_tt'>\</span><i class='mkup_i'>n</i>)</dt><dd class='mkup_dd'>Matches the exact substring matched by the <i class='mkup_i'>n</i>th capturing parenthesis in the RE. This is labeled <b class='mkup_b'>the Feature from the Black Lagoon</b> in the sources for the Tcl regexp engine. The (only?) non-regular feature found in the regular expressions of the <a class='mkup_a mkup_known' href='/page/posix'>POSIX</a> standard. Practical applications include parsing data where one can introduce custom separator strings.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'></dt><dd class='mkup_dd'>To see the power of back references, one may observe that they can be used to recognise <a class='mkup_a mkup_known' href='/page/Primes'>primes</a>. <span class='mkup_tt'>^(oo+?)\1+$</span> will match a string of <i class='mkup_i'>n</i> <span class='mkup_tt'>o</span>'s if and only if <i class='mkup_i'>n</i> is a product <i class='mkup_i'>ab</i> for integers <i class='mkup_i'>a,b</i> ≥ 2; <i class='mkup_i'>a</i> is the length of submatch 1 and <i class='mkup_i'>b</i> is the number of times it is repeated.</dd></dl><dl class='mkup_dl'><dt class='mkup_dt'>Recursive match (<span class='mkup_tt'>\R</span>) and subroutines</dt><dd class='mkup_dd'>Behave as if a copy of the whole regular expression, or of some "named subexpression", occurred at this point when trying to match. Together with <span class='mkup_tt'>|</span>, this is all one needs for a general context-free grammar [<a rel='nofollow' class='mkup_a' href="https://en.wikipedia.org/wiki/Context-free_grammar">L1 <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>] (although the syntax is typically hideous compared to the standard <a class='mkup_a mkup_known' href='/page/BNF'>BNF</a> notation for these), which conversely means any engine capable of these features have to be <i class='mkup_i'>at least as slow</i> (roughly time cubic in input size) as a context-free language parser when handling these, and quite likely far slower for the worst cases. The Tcl regexp engine does not have these features.</dd></dl> <h2 id='6fe7befa7cf9f455e06bd2a496637dae411603e4ee38602ea3fd40ac0464c08e' class='mkup_h1'> How Regular Expression Features are Implemented using Finite Automata </h2><p class='mkup_p'>This is a partial list of tricks. It also assumes some familiarity with finite automata theory, such as knowing what distinguishes an NFA from a DFA, how one runs them, and how one can construct one from the other (all of which is standard material in relevant computer science courses).</p> <h3 id='eb07bde6001cedda8d6ad5e154981eb9a7eeb8108be12da82987a17bcc670698' class='mkup_h2'>Search mode regexps</h3><p class='mkup_p'>Given a match-mode regexp engine, as one would get from running a finite automaton over a string and inspect whether the end state is final, one can run a regular expression <i class='mkup_i'>re</i> in search mode on it by running <span class='mkup_tt'>.*(</span><i class='mkup_i'>re</i><span class='mkup_tt'>).*</span> in match mode.</p><p class='mkup_p'>[Also explain what one can do to <i class='mkup_i'>find</i> the match searched for, without going for a full-blown submatch-capable engine.]</p> <h3 id='dce6c589227eda7f3ef1322740725ee4c3fb1f2bc4bc2bfc34c3b42b4b9ac4ff' class='mkup_h2'> Submatch capturing </h3><p class='mkup_p'>As usually defined, finite automata can only answer "yes" or "no", so there's no way to get submatch information out of them. </p><p class='mkup_p'>An extension of the formalism (keeping track of positions within the string corresponding to positions in the regexp, as well as the basic automaton state) can be found in <a rel='nofollow' class='mkup_a' href='http://laurikari.net/ville/spire2000-tnfa.ps'>NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a></p> <h3 id='d078d94f6528c50e7e1a5420a56fe2ec27569cf1d2447f9d111945f7581f24c9' class='mkup_h2'> Boolean AND / language intersection </h3><p class='mkup_p'>This is a classic trick.</p><p class='mkup_p'>Given one (ε-free) automaton <i class='mkup_i'>A1</i> for matching the regular expression <i class='mkup_i'>re1</i> and another (ε-free) automaton <i class='mkup_i'>A2</i> for matching the regular expression <i class='mkup_i'>re2</i>, it is straightforward to construct an automaton <i class='mkup_i'>A</i> for <i class='mkup_i'>re1</i> AND <i class='mkup_i'>re2</i> as follows:</p><OL class='mkup_OL'><li class='mkup_li'>If the state set of <i class='mkup_i'>A1</i> is <i class='mkup_i'>S1</i> and the state set of <i class='mkup_i'>A2</i> is <i class='mkup_i'>S2</i>, then the state set of <i class='mkup_i'>A</i> will be <i class='mkup_i'>S1</i> × <i class='mkup_i'>S2</i> (<a class='mkup_a mkup_known' href='/page/Cartesian+product'>cartesian product</a> of the two sets).</li><li class='mkup_li'>There is a transition from (<i class='mkup_i'>u1</i>,<i class='mkup_i'>u2</i>) to (<i class='mkup_i'>v1</i>,<i class='mkup_i'>v2</i>) labelled <i class='mkup_i'>x</i> in <i class='mkup_i'>A</i> if and only if there is a transition from <i class='mkup_i'>u1</i> to <i class='mkup_i'>v1</i> labelled <i class='mkup_i'>x</i> in <i class='mkup_i'>A1</i> and a transition from <i class='mkup_i'>u2</i> to <i class='mkup_i'>v2</i> labelled <i class='mkup_i'>x</i> in <i class='mkup_i'>A2</i>.</li><li class='mkup_li'>A state (<i class='mkup_i'>u</i>,<i class='mkup_i'>v</i>) is final in <i class='mkup_i'>A</i> if and only if <i class='mkup_i'>u</i> is final in <i class='mkup_i'>A1</i> and <i class='mkup_i'>v</i> is final in <i class='mkup_i'>A2</i>.</li><li class='mkup_li'>Similarly, a state (<i class='mkup_i'>u</i>,<i class='mkup_i'>v</i>) is initial in <i class='mkup_i'>A</i> if and only if <i class='mkup_i'>u</i> is initial in <i class='mkup_i'>A1</i> and <i class='mkup_i'>v</i> is initial in <i class='mkup_i'>A2</i>.</li></OL><p class='mkup_p'>What this means in practice is that running <i class='mkup_i'>A</i> is equivalent to running <i class='mkup_i'>A1</i> and <i class='mkup_i'>A2</i> simultaneously; each <i class='mkup_i'>A</i> state is a pair of an <i class='mkup_i'>A1</i> state and an <i class='mkup_i'>A2</i> state. <i class='mkup_i'>A</i> accepts a string only if both <i class='mkup_i'>A1</i> and <i class='mkup_i'>A2</i> would do so.</p> <h3 id='9950ab117254aabebc06e4e404603e881ef9ce76057eac975853154eca4b9a2c' class='mkup_h2'> Boolean NOT / language complementation </h3><p class='mkup_p'>Assuming you've got a DFA for matching a regular language, this is very straightforward: negate the "final" status for every state. States that were previously accepting will then be non-accepting, and vice versa, so strings that were previously accepted will now be rejected (and vice versa).</p> <h3 id='b5bbfa2212f166d216ea7c26f532b375043354d622e42111134ebdd94bd388c0' class='mkup_h2'> Lookahead/-behind constraints </h3><p class='mkup_p'>The basic idea is the same as for boolean AND. First make an automaton for the main RE, then make so many copies of this that you can simultaneously keep track of the state visavi the man RE and the constraint RE, while adjusting the transitions suitably so that you run both in parallel; the projection onto either axis will however still be automata for the main and constraint respectively RE. Finally modify the sets of initial and final states appropriately for the wanted result.</p><p class='mkup_p'>Basically, the projection of the constraint onto the main RE "axis" would be an ε-transition, but it is not parallel to that axis. Rather, it will take you from being debt-free (having all instances of the constraint satisfied, and the string as a whole thus eligible for acceptance) to a debt of 1 match (preventing the string from being accepted), and in order to become free of this debt you will have to work the automaton until reaching a constraint-accepting state again (kind of like an old platform game, where one might fall down a chute into the underground and then painfully having to work one's way up to the surface again before being allowed to finish the game).</p><p class='mkup_p'>What makes this more complicated than the boolean AND is that several instances of the constraint may be relevant at the same time. Consider for example</p><div class='sh_sourceCode'><button class='copybtn btn pull-right' data-clipboard-target='#mkup_code_2' title='Click to copy code snippet to clipboard'><span class='glyphicon glyphicon-copy' aria-hidden='true'></span></button><pre id='mkup_code_2' class='mkup_pre sh_sourceCode'>((?=(...)?a)[abc])*</pre></div><p class='mkup_p'>where each iteration of the outer parenthesis eats one letter, but the constraint is looking for an <span class='mkup_tt'>a</span> up to four characters ahead. It seems the only way to manage that is to run all possibilities in parallel, which means the "constraint axis" of the composite automaton is labelled not by states of an automaton for matching the constraint RE, but by <i class='mkup_i'>sets of such states</i> (like in the subset construction — no wonder the Tcl engine prefers to go hybrid here)!</p><p class='mkup_p'>Also, that construction only takes care of one constraint at a time. In order to remove all constraints, it would be necessary to repeat it as many times as there are constraints in the expression! Luckily, constraint REs encountered in practice (such as those for beginning-of-word, look one character back and forth) tend to require only a very small number of states, so it is in fact feasible to use them, even with a pure automaton engine.</p><h2 id='0fc92bade57dd836ac53aa9e06f71a817a234af8fa3d3f00a9a1270dcc65e669' class='mkup_h1'> Regular expression puzzles </h2><p class='mkup_p'><a class='mkup_a mkup_known' href='/page/AMG'>AMG</a>: I found a couple regular expression puzzles online that are very good for practicing and improving your regular expression skills. Highly recommended. They're not written in Tcl, but I think creating Tcl versions would be a wonderful project.</p><dl class='mkup_dl'><dt class='mkup_dt'>Regex Crossword [<a rel='nofollow' class='mkup_a' href='https://regexcrossword.com/'>L2 <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>]</dt><dd class='mkup_dd'>Fill in a rectangular or hexagonal grid to simultaneously satisfy all the regular expressions</dd><dt class='mkup_dt'>Regex Golf [<a rel='nofollow' class='mkup_a' href='http://regex.alf.nu/'>L3 <span class='glyphicon glyphicon-globe' aria-hidden='true'></span></a>]</dt><dd class='mkup_dd'>Write a regular expression that matches everything in one set and nothing in another set</dd></dl><h2 id='d2867aa1281341e2e3c60281019f5ab8cb5ae297f2cd3f1fe487586bbd979a2b' class='mkup_h1'> Regular expression flavours </h2><p class='mkup_p'>[Old discussion, could do with refactoring.]</p><p class='mkup_p'>Okay then - feel free to add information here on the other RE flavors available in Tcl...</p><p class='mkup_p'><a class='mkup_a mkup_known' href='/page/LES'>LES</a> says that there no other RE flavors available in Tcl. Tcl only uses <a class='mkup_a mkup_known' href='/page/ARE'>ARE</a>. What I meant is that regular expressions may be construed as any one (or all) of its several variations, but <a class='mkup_a mkup_known' href='/page/Regular+Expressions'>Regular Expressions</a> only discusses Tcl's <a class='mkup_a mkup_known' href='/page/ARE'>ARE</a>. I said that because this wiki discusses many things under several contexts, not necessarily that of Tcl, and I thought it would be good to note that, at least in this case, <b class='mkup_b'>it is</b> restricted to the context of Tcl. Anyone interested in a different or more ample discussion of Regular Expressions will have to look elsewhere. E.g. on <a class='mkup_a mkup_known' href='/page/PCRE'>PCRE</a>.</p><hr><div class='mkup_centered'><table class='mkup_categories'><tr><td class='mkup_td'><a class='mkup_a' href='/page/Category+Glossary'>Category Glossary</a></td><td class='mkup_td'><a class='mkup_a' href='/page/Category+Concept'>Category Concept</a></td></tr></table></div> </div> </div> <div class='row'> <div class='col-xs-12'> <div class='Footer'>Updated 2019-11-23 15:44:54</div> </div> </div> </div> <!-- jQuery library --> <script src='https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js'></script> <script src='https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.0/js/jquery.tablesorter.combined.js'></script> <!-- Latest compiled JavaScript --> <script src='https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js'></script> <script type='text/javascript' src='/scripts/nikit.js'></script> <script type='text/javascript' src='/scripts/sh_main.js'></script> <script type='text/javascript' src='/scripts/sh_tcl.js'></script> <script type='text/javascript' src='/scripts/sh_c.js'></script> <script type='text/javascript' src='/scripts/sh_cpp.js'></script> <!-- <script src='https://www.google.com/recaptcha/api.js'></script> --> <script src='https://hcaptcha.com/1/api.js'></script> <script>var clipboard = new ClipboardJS('.copybtn', { text: function(trigger) { return document.querySelector(trigger.getAttribute('data-clipboard-target')).textContent + '\n'; } }); sort_tables(); </script> </body> </html>