CINXE.COM

Regex Tutorial - Lookahead and Lookbehind Zero-Length Assertions

<!DOCTYPE html> <html lang="en"><head><meta charset="utf-8"><link rel=canonical href='https://https://www.regular-expressions.info//lookaround.html'><title>Regex Tutorial - Lookahead and Lookbehind Zero-Length Assertions</title> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="author" content="Jan Goyvaerts"> <meta name="description" content="Test for a match, or test for failure, without actually consuming any characters."> <meta name="keywords" content=""> <link rel=stylesheet href="regex.css" type="text/css"><script src="theme.js" type="text/javascript"></script><link rel="alternate" type="application/rss+xml" title="New at Regular-Expressions.info" href="updates.xml"> </head> <body bgcolor=white text=black> <div id=top></div> <div id=btntop><div id=btngrid><a href="quickstart.html" target="_top"><div>Quick&nbsp;Start</div></a><a href="tutorial.html" target="_top"><div>Tutorial</div></a><a href="tools.html" target="_top"><div>Tools&nbsp;&amp;&nbsp;Languages</div></a><a href="examples.html" target="_top"><div>Examples</div></a><a href="refflavors.html" target="_top"><div>Reference</div></a><a href="books.html" target="_top"><div>Book&nbsp;Reviews</div></a></div></div> <div id=contents><div id=side> <TABLE CLASS=side CELLSPACING=0 CELLPADDING=4><TR><TD CLASS=sideheader>Regex Tutorial</TD></TR><TR><TD><A HREF="tutorial.html" TARGET=_top>Introduction</A></TD></TR><TR><TD><A HREF="tutorialcnt.html" TARGET=_top>Table of Contents</A></TD></TR><TR><TD><A HREF="characters.html" TARGET=_top>Special Characters</A></TD></TR><TR><TD><A HREF="nonprint.html" TARGET=_top>Non-Printable Characters</A></TD></TR><TR><TD><A HREF="engine.html" TARGET=_top>Regex Engine Internals</A></TD></TR><TR><TD><A HREF="charclass.html" TARGET=_top>Character Classes</A></TD></TR><TR><TD><A HREF="charclasssubtract.html" TARGET=_top>Character Class Subtraction</A></TD></TR><TR><TD><A HREF="charclassintersect.html" TARGET=_top>Character Class Intersection</A></TD></TR><TR><TD><A HREF="shorthand.html" TARGET=_top>Shorthand Character Classes</A></TD></TR><TR><TD><A HREF="dot.html" TARGET=_top>Dot</A></TD></TR><TR><TD><A HREF="anchors.html" TARGET=_top>Anchors</A></TD></TR><TR><TD><A HREF="wordboundaries.html" TARGET=_top>Word Boundaries</A></TD></TR><TR><TD><A HREF="alternation.html" TARGET=_top>Alternation</A></TD></TR><TR><TD><A HREF="optional.html" TARGET=_top>Optional Items</A></TD></TR><TR><TD><A HREF="repeat.html" TARGET=_top>Repetition</A></TD></TR><TR><TD><A HREF="brackets.html" TARGET=_top>Grouping &amp; Capturing</A></TD></TR><TR><TD><A HREF="backref.html" TARGET=_top>Backreferences</A></TD></TR><TR><TD><A HREF="backref2.html" TARGET=_top>Backreferences, part 2</A></TD></TR><TR><TD><A HREF="named.html" TARGET=_top>Named Groups</A></TD></TR><TR><TD><A HREF="backrefrel.html" TARGET=_top>Relative Backreferences</A></TD></TR><TR><TD><A HREF="branchreset.html" TARGET=_top>Branch Reset Groups</A></TD></TR><TR><TD><A HREF="freespacing.html" TARGET=_top>Free-Spacing &amp; Comments</A></TD></TR><TR><TD><A HREF="unicode.html" TARGET=_top>Unicode</A></TD></TR><TR><TD><A HREF="modifiers.html" TARGET=_top>Mode Modifiers</A></TD></TR><TR><TD><A HREF="atomic.html" TARGET=_top>Atomic Grouping</A></TD></TR><TR><TD><A HREF="possessive.html" TARGET=_top>Possessive Quantifiers</A></TD></TR><TR><TD><A HREF="lookaround.html" TARGET=_top>Lookahead &amp; Lookbehind</A></TD></TR><TR><TD><A HREF="lookaround2.html" TARGET=_top>Lookaround, part 2</A></TD></TR><TR><TD><A HREF="keep.html" TARGET=_top>Keep Text out of The Match</A></TD></TR><TR><TD><A HREF="conditional.html" TARGET=_top>Conditionals</A></TD></TR><TR><TD><A HREF="balancing.html" TARGET=_top>Balancing Groups</A></TD></TR><TR><TD><A HREF="recurse.html" TARGET=_top>Recursion</A></TD></TR><TR><TD><A HREF="subroutine.html" TARGET=_top>Subroutines</A></TD></TR><TR><TD><A HREF="recurseinfinite.html" TARGET=_top>Infinite Recursion</A></TD></TR><TR><TD><A HREF="recurserepeat.html" TARGET=_top>Recursion &amp; Quantifiers</A></TD></TR><TR><TD><A HREF="recursecapture.html" TARGET=_top>Recursion &amp; Capturing</A></TD></TR><TR><TD><A HREF="recursebackref.html" TARGET=_top>Recursion &amp; Backreferences</A></TD></TR><TR><TD><A HREF="recursebacktrack.html" TARGET=_top>Recursion &amp; Backtracking</A></TD></TR><TR><TD><A HREF="posixbrackets.html" TARGET=_top>POSIX Bracket Expressions</A></TD></TR><TR><TD><A HREF="zerolength.html" TARGET=_top>Zero-Length Matches</A></TD></TR><TR><TD><A HREF="continue.html" TARGET=_top>Continuing Matches</A></TD></TR> </TABLE><TABLE CLASS=side CELLSPACING=0 CELLPADDING=4><TR><TD CLASS=sideheader>More on This Site</TD></TR><TR><TD><A HREF="index.html" TARGET=_top>Introduction</A></TD></TR><TR><TD><A HREF="quickstart.html" TARGET=_top>Regular Expressions Quick Start</A></TD></TR><TR><TD><A HREF="tutorial.html" TARGET=_top>Regular Expressions Tutorial</A></TD></TR><TR><TD><A HREF="replacetutorial.html" TARGET=_top>Replacement Strings Tutorial</A></TD></TR><TR><TD><A HREF="tools.html" TARGET=_top>Applications and Languages</A></TD></TR><TR><TD><A HREF="examples.html" TARGET=_top>Regular Expressions Examples</A></TD></TR><TR><TD><A HREF="refflavors.html" TARGET=_top>Regular Expressions Reference</A></TD></TR><TR><TD><A HREF="refreplace.html" TARGET=_top>Replacement Strings Reference</A></TD></TR><TR><TD><A HREF="books.html" TARGET=_top>Book Reviews</A></TD></TR><TR><TD><A HREF="print.html" TARGET=_top>Printable PDF</A></TD></TR><TR><TD><A HREF="about.html" TARGET=_top>About This Site</A></TD></TR><TR><TD><A HREF="updates.html" TARGET=_top>RSS Feed &amp; Blog</A></TD></TR></TABLE></DIV><div class=bodytext><div class=topad style="height:130px"><A HREF="https://www.regexbuddy.com/create.html" TARGET="_top"><picture><source media="(max-width: 370px)" srcset="ads/320/rxbtutorial100.png 1x, ads/320/rxbtutorial150.png 1.5x, ads/320/rxbtutorial200.png 2x, ads/320/rxbtutorial250.png 2.5x, ads/320/rxbtutorial300.png 3x, ads/320/rxbtutorial350.png 3.5x, ads/320/rxbtutorial400.png 4x"><source media="(max-width: 500px)" srcset="ads/360/rxbtutorial100.png 1x, ads/360/rxbtutorial150.png 1.5x, ads/360/rxbtutorial200.png 2x, ads/360/rxbtutorial250.png 2.5x, ads/360/rxbtutorial300.png 3x, ads/360/rxbtutorial350.png 3.5x, ads/360/rxbtutorial400.png 4x"><source media="(max-width: 660px)" srcset="ads/480/rxbtutorial100.png 1x, ads/480/rxbtutorial150.png 1.5x, ads/480/rxbtutorial200.png 2x, ads/480/rxbtutorial250.png 2.5x, ads/480/rxbtutorial300.png 3x, ads/480/rxbtutorial350.png 3.5x, ads/480/rxbtutorial400.png 4x"><source media="(max-width: 747px)" srcset="ads/640/rxbtutorial100.png 1x, ads/640/rxbtutorial150.png 1.5x, ads/640/rxbtutorial200.png 2x, ads/640/rxbtutorial250.png 2.5x, ads/640/rxbtutorial300.png 3x, ads/640/rxbtutorial350.png 3.5x, ads/640/rxbtutorial400.png 4x"><img src="ads/728/rxbtutorial100.png" srcset="ads/728/rxbtutorial100.png 1x, ads/728/rxbtutorial125.png 1.25x, ads/728/rxbtutorial150.png 1.5x, ads/728/rxbtutorial175.png 1.75x, ads/728/rxbtutorial200.png 2x, ads/728/rxbtutorial250.png 2.5x, ads/728/rxbtutorial300.png 3x, ads/728/rxbtutorial350.png 3.5x, ads/728/rxbtutorial400.png 4x" alt="RegexBuddy—Better than a regular expression tutorial!"></picture></A></div> <div class=bulb><h1>Lookahead and Lookbehind Zero-Length Assertions</h1><script type="text/javascript">showbulb();</script></div> <p>Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions just like the <A HREF="anchors.html" TARGET="_top">start and end of line</A>, and <A HREF="wordboundaries.html" TARGET="_top">start and end of word</A> anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. They do not consume characters in the string, but only assert whether a match is possible or not. Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.</p> <a name="lookahead"></a><h2>Positive and Negative Lookahead</h2> <p>Negative lookahead is indispensable if you want to match something not followed by something else. When explaining <A HREF="charclass.html" TARGET="_top">character classes</A>, this tutorial explained why you cannot use a negated character class to match a <tt class=match>q</tt> not followed by a <tt class=string>u</tt>. Negative lookahead provides the solution: <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN><SPAN CLASS="regexnest1">(?!</SPAN><SPAN CLASS="regexplain">u</SPAN><SPAN CLASS="regexnest1">)</SPAN></TT>. The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex <TT CLASS=syntax><SPAN CLASS="regexplain">u</SPAN></TT>.</p> <p>Positive lookahead works just the same. <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN><SPAN CLASS="regexnest1">(?=</SPAN><SPAN CLASS="regexplain">u</SPAN><SPAN CLASS="regexnest1">)</SPAN></TT> matches a q that is followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.</p> <p>You can use any regular expression inside the lookahead (but not lookbehind, as explained below). Any valid regular expression can be used inside the lookahead. If it contains <A HREF="brackets.html" TARGET="_top">capturing groups</A> then those groups will capture as normal and backreferences to them will work normally, even outside the lookahead. (The only exception is <A HREF="tcl.html" TARGET="_top">Tcl</A>, which treats all groups inside lookahead as non-capturing.) The lookahead itself is not a capturing group. It is not included in the count towards numbering the backreferences. If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this: <TT CLASS=syntax><SPAN CLASS="regexnest1">(?=</SPAN><SPAN CLASS="regexnest2">(</SPAN><SPAN CLASS="regexplain">regex</SPAN><SPAN CLASS="regexnest2">)</SPAN><SPAN CLASS="regexnest1">)</SPAN></TT>. The other way around will not work, because the lookahead will already have discarded the regex match by the time the capturing group is to store its match.</p> <h2>Regex Engine Internals</h2> <p>First, let’s see how the engine applies <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN><SPAN CLASS="regexnest1">(?!</SPAN><SPAN CLASS="regexplain">u</SPAN><SPAN CLASS="regexnest1">)</SPAN></TT> to the string <tt class=string>Iraq</tt>. The first token in the regex is the <A HREF="characters.html" TARGET="_top">literal</A> <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN></TT>. As we already know, this causes the engine to traverse the string until the <tt class=match>q</tt> in the string is matched. The position in the string is now the void after the string. The next token is the lookahead. The engine takes note that it is inside a lookahead construct now, and begins matching the regex inside the lookahead. So the next token is <TT CLASS=syntax><SPAN CLASS="regexplain">u</SPAN></TT>. This does not match the void after the string. The engine notes that the regex inside the lookahead failed. Because the lookahead is negative, this means that the lookahead has successfully matched at the current position. At this point, the entire regex has matched, and <tt class=match>q</tt> is returned as the match.</p> <p>Let’s try applying the same regex to <tt class=string>quit</tt>. <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN></TT> matches <tt class=match>q</tt>. The next token is the <TT CLASS=syntax><SPAN CLASS="regexplain">u</SPAN></TT> inside the lookahead. The next character is the <tt class=string>u</tt>. These match. The engine advances to the next character: <tt class=string>i</tt>. However, it is done with the regex inside the lookahead. The engine notes success, and discards the regex match. This causes the engine to step back in the string to <tt class=string>u</tt>.</p> <p>Because the lookahead is negative, the successful match inside it causes the lookahead to fail. Since there are no other permutations of this regex, the engine has to start again at the beginning. Since <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN></TT> cannot match anywhere else, the engine reports failure.</p> <p>Let’s take one more look inside, to make sure you understand the implications of the lookahead. Let’s apply <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN><SPAN CLASS="regexnest1">(?=</SPAN><SPAN CLASS="regexplain">u</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">i</SPAN></TT> to <tt class=string>quit</tt>. The lookahead is now positive and is followed by another token. Again, <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN></TT> matches <tt class=match>q</tt> and <TT CLASS=syntax><SPAN CLASS="regexplain">u</SPAN></TT> matches <tt class=match>u</tt>. Again, the match from the lookahead must be discarded, so the engine steps back from <tt class=string>i</tt> in the string to <tt class=string>u</tt>. The lookahead was successful, so the engine continues with <TT CLASS=syntax><SPAN CLASS="regexplain">i</SPAN></TT>. But <TT CLASS=syntax><SPAN CLASS="regexplain">i</SPAN></TT> cannot match <tt class=string>u</tt>. So this match attempt fails. All remaining attempts fail as well, because there are no more q’s in the string.</p> <p>The regex <TT CLASS=syntax><SPAN CLASS="regexplain">q</SPAN><SPAN CLASS="regexnest1">(?=</SPAN><SPAN CLASS="regexplain">u</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">i</SPAN></TT> can never match anything. It tries to match <TT CLASS=syntax><SPAN CLASS="regexplain">u</SPAN></TT> and <TT CLASS=syntax><SPAN CLASS="regexplain">i</SPAN></TT> at the same position. If there is a <tt class=string>u</tt> immediately after the <tt class=string>q</tt> then the lookahead succeeds but then <TT CLASS=syntax><SPAN CLASS="regexplain">i</SPAN></TT> fails to match <tt class=string>u</tt>. If there is anything other than a <tt class=string>u</tt> immediately after the <tt class=string>q</tt> then the lookahead fails.</p> <a name="lookbehind"></a><h2>Positive and Negative Lookbehind</h2> <p>Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. <TT CLASS=syntax><SPAN CLASS="regexnest1">(?&lt;!</SPAN><SPAN CLASS="regexplain">a</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">b</SPAN></TT> matches a “b” that is not preceded by an “a”, using negative lookbehind. It doesn’t match <tt class=string>cab</tt>, but matches the <tt class=match>b</tt> (and only the <tt class=match>b</tt>) in <tt class=string>bed</tt> or <tt class=string>debt</tt>. <TT CLASS=syntax><SPAN CLASS="regexnest1">(?&lt;=</SPAN><SPAN CLASS="regexplain">a</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">b</SPAN></TT> (positive lookbehind) matches the <tt class=match>b</tt> (and only the <tt class=match>b</tt>) in <tt class=match>cab</tt>, but does not match <tt class=string>bed</tt> or <tt class=string>debt</tt>.</p> <p>The construct for positive lookbehind is <TT CLASS=syntax><SPAN CLASS="regexnest1">(?&lt;=</SPAN><SPAN CLASS="regexplain">text</SPAN><SPAN CLASS="regexnest1">)</SPAN></TT>: a pair of parentheses, with the opening parenthesis followed by a question mark, “less than” symbol, and an equals sign. Negative lookbehind is written as <TT CLASS=syntax><SPAN CLASS="regexnest1">(?&lt;!</SPAN><SPAN CLASS="regexplain">text</SPAN><SPAN CLASS="regexnest1">)</SPAN></TT>, using an exclamation point instead of an equals sign.</p> <h2>More Regex Engine Internals</h2> <p>Let’s apply <TT CLASS=syntax><SPAN CLASS="regexnest1">(?&lt;=</SPAN><SPAN CLASS="regexplain">a</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">b</SPAN></TT> to <tt class=string>thingamabob</tt>. The engine starts with the lookbehind and the first character in the string. In this case, the lookbehind tells the engine to step back one character, and see if <TT CLASS=syntax><SPAN CLASS="regexplain">a</SPAN></TT> can be matched there. The engine cannot step back one character because there are no characters before the <tt class=string>t</tt>. So the lookbehind fails, and the engine starts again at the next character, the <tt class=string>h</tt>. (Note that a negative lookbehind would have succeeded here.) Again, the engine temporarily steps back one character to check if an “a” can be found there. It finds a <tt class=string>t</tt>, so the positive lookbehind fails again.</p> <p>The lookbehind continues to fail until the regex reaches the <tt class=string>m</tt> in the string. The engine again steps back one character, and notices that the <tt class=match>a</tt> can be matched there. The positive lookbehind matches. Because it is zero-length, the current position in the string remains at the <tt class=string>m</tt>. The next token is <TT CLASS=syntax><SPAN CLASS="regexplain">b</SPAN></TT>, which cannot match here. The next character is the second <tt class=string>a</tt> in the string. The engine steps back, and finds out that the <tt class=string>m</tt> does not match <TT CLASS=syntax><SPAN CLASS="regexplain">a</SPAN></TT>.</p> <p>The next character is the first <tt class=string>b</tt> in the string. The engine steps back and finds out that <tt class=match>a</tt> satisfies the lookbehind. <TT CLASS=syntax><SPAN CLASS="regexplain">b</SPAN></TT> matches <tt class=match>b</tt>, and the entire regex has been matched successfully. It matches one character: the first <tt class=match>b</tt> in the string.</p> <a name="limitbehind"></a><h2>Important Notes About Lookbehind</h2> <p>The good news is that you can use lookbehind anywhere in the regex, not only at the start. If you want to find a word not ending with an “s”, you could use <TT CLASS=syntax><SPAN CLASS="regexspecial">\b</SPAN><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexnest1">(?&lt;!</SPAN><SPAN CLASS="regexplain">s</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">\b</SPAN></TT>. This is definitely not the same as <TT CLASS=syntax><SPAN CLASS="regexspecial">\b</SPAN><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">s</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">\b</SPAN></TT>. When applied to <tt class=string>John&apos;s</tt>, the former matches <tt class=match>John</tt> and the latter matches <tt class=match>John&apos;</tt> (including the apostrophe). I will leave it up to you to figure out why. (Hint: <TT CLASS=syntax><SPAN CLASS="regexspecial">\b</SPAN></TT> matches between the apostrophe and the <tt class=string>s</tt>). The latter also doesn’t match single-letter words like “a” or “I”. The correct regex without using lookbehind is <TT CLASS=syntax><SPAN CLASS="regexspecial">\b</SPAN><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">s</SPAN><SPAN CLASS="regexccspecial">\W</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">\b</SPAN></TT> (star instead of plus, and \W in the character class). Personally, I find the lookbehind easier to understand. The last regex, which works correctly, has a double negation (the \W in the negated character class). Double negations tend to be confusing to humans. Not to regex engines, though. (Except perhaps for Tcl, which treats negated shorthands in negated character classes as an error.)</p> <p>The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind. When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind, steps back that many characters in the subject string, and then applies the regex inside the lookbehind from left to right just as it would with a normal regex.</p> <p>Many regex flavors, including those used by <A HREF="perl.html" TARGET="_top">Perl</A>, <A HREF="python.html" TARGET="_top">Python</A>, and <A HREF="boost.html" TARGET="_top">Boost</A> only allow fixed-length strings. You can use <A HREF="characters.html" TARGET="_top">literal text</A>, <a href="nonprint.html#hex">character escapes</a>, <a href="nonprint.html#hex">Unicode escapes</a> other than <TT CLASS=syntax><SPAN CLASS="regexspecial">\X</SPAN></TT>, and <A HREF="charclass.html" TARGET="_top">character classes</A>. You cannot use <A HREF="repeat.html" TARGET="_top">quantifiers</A> or <A HREF="backref.html" TARGET="_top">backreferences</A>. You can use <A HREF="alternation.html" TARGET="_top">alternation</A>, but only if all alternatives have the same length. These flavors evaluate lookbehind by first stepping back through the subject string for as many characters as the lookbehind needs, and then attempting the regex inside the lookbehind from left to right.</p> <p>Perl 5.30 supports variable-length lookbehind as an experimental feature. But there are many cases in which it does not work correctly. So in practice, the above is still true for Perl 5.30.</p> <p><A HREF="pcre.html" TARGET="_top">PCRE</A> is not fully Perl-compatible when it comes to lookbehind. While Perl requires alternatives inside lookbehind to have the same length, PCRE allows alternatives of variable length. <A HREF="php.html" TARGET="_top">PHP</A>, <A HREF="delphi.html" TARGET="_top">Delphi</A>, <A HREF="rlanguage.html" TARGET="_top">R</A>, and <A HREF="ruby.html" TARGET="_top">Ruby</A> also allow this. Each alternative still has to be fixed-length. Each alternative is treated as a separate fixed-length lookbehind.</p> <p><A HREF="java.html" TARGET="_top">Java</A> takes things a step further by allowing finite repetition. You can use the <A HREF="optional.html" TARGET="_top">question mark</A> and the <A HREF="repeat.html" TARGET="_top">curly braces</A> with the <i>max</i> parameter specified. Java determines the minimum and maximum possible lengths of the lookbehind. The lookbehind in the regex <TT CLASS=syntax><SPAN CLASS="regexnest1">(?&lt;!</SPAN><SPAN CLASS="regexplain">a</SPAN><SPAN CLASS="regexplain">b</SPAN><SPAN CLASS="regexspecial">{2,4}</SPAN><SPAN CLASS="regexplain">c</SPAN><SPAN CLASS="regexspecial">{3,5}</SPAN><SPAN CLASS="regexplain">d</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">test</SPAN></TT> has 5 possible lengths. It can be from 7 through 11 characters long. When Java (version 6 or later) tries to match the lookbehind, it first steps back the minimum number of characters (7 in this example) in the string and then evaluates the regex inside the lookbehind as usual, from left to right. If it fails, Java steps back one more character and tries again. If the lookbehind continues to fail, Java continues to step back until the lookbehind either matches or it has stepped back the maximum number of characters (11 in this example). This repeated stepping back through the subject string kills performance when the number of possible lengths of the lookbehind grows. Keep this in mind. Don’t choose an arbitrarily large maximum number of repetitions to work around the lack of infinite quantifiers inside lookbehind. Java 4 and 5 have bugs that cause lookbehind with alternation or variable quantifiers to fail when it should succeed in some situations. These bugs were fixed in Java 6.</p> <p>Java 13 allows you to use the <A HREF="repeat.html" TARGET="_top">star</A> and <A HREF="repeat.html" TARGET="_top">plus</A> inside lookbehind, as well as <A HREF="repeat.html" TARGET="_top">curly braces</A> without an upper limit. But Java 13 still uses the laborious method of matching lookbehind introduced with Java 6. Java 13 also does not correctly handle lookbehind with multiple quantifiers if one of them is unbounded. In some situations you may get an error. In other situations you may get incorrect matches. So for both correctness and performance, we recommend you only use quantifiers with a low upper bound in lookbehind with Java 6 through 13.</p> <p>The only regex engines that allow you to use a full regular expression inside lookbehind, including infinite repetition and backreferences, are the <A HREF="jgsoft.html" TARGET="_top">JGsoft engine</A> and the <A HREF="dotnet.html" TARGET="_top">.NET RegEx classes</A>. These regex engines really apply the regex inside the lookbehind backwards, going through the regex inside the lookbehind and through the subject string from right to left. They only need to evaluate the lookbehind once, regardless of how many different possible lengths it has.</p> <p>Finally, flavors like <A HREF="stdregex.html" TARGET="_top">std::regex</A> and <A HREF="tcl.html" TARGET="_top">Tcl</A> do not support lookbehind at all, even though they do support lookahead. <A HREF="javascript.html" TARGET="_top">JavaScript</A> was like that for the longest time since its inception. But now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2019), Google’s Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can’t use lookbehind in JavaScript.</p> <h2>Lookaround Is Atomic</h2> <p>The fact that lookaround is zero-length automatically makes it <a href="atomic.html#use">atomic</a>. As soon as the lookaround condition is satisfied, the regex engine forgets about everything inside the lookaround. It will not backtrack inside the lookaround to try different permutations.</p> <p>The only situation in which this makes any difference is when you use <A HREF="brackets.html" TARGET="_top">capturing groups</A> inside the lookaround. Since the regex engine does not backtrack into the lookaround, it will not try different permutations of the capturing groups.</p> <p>For this reason, the regex <TT CLASS=syntax><SPAN CLASS="regexnest1">(?=</SPAN><SPAN CLASS="regexnest2">(</SPAN><SPAN CLASS="regexspecial">\d</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexnest2">)</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexspecial">\1</SPAN></TT> never matches <tt class=string>123x12</tt>. First the lookaround captures <tt class=match>123</tt> into <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT>. <TT CLASS=syntax><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">+</SPAN></TT> then matches the whole string and backtracks until it matches only <tt class=match>1</tt>. Finally, <TT CLASS=syntax><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">+</SPAN></TT> fails since <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> cannot be matched at any position. Now, the regex engine has nothing to backtrack to, and the overall regex fails. The backtracking steps created by <TT CLASS=syntax><SPAN CLASS="regexspecial">\d</SPAN><SPAN CLASS="regexspecial">+</SPAN></TT> have been discarded. It never gets to the point where the lookahead captures only <tt class=string>12</tt>.</p> <p>Obviously, the regex engine does try further positions in the string. If we change the subject string, the regex <TT CLASS=syntax><SPAN CLASS="regexnest1">(?=</SPAN><SPAN CLASS="regexnest2">(</SPAN><SPAN CLASS="regexspecial">\d</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexnest2">)</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexspecial">\1</SPAN></TT> does match <tt class=match>56x56</tt> in <tt class=string>456x56</tt>.</p> <p>If you don’t use capturing groups inside lookaround, then all this doesn’t matter. Either the lookaround condition can be satisfied or it cannot be. In how many ways it can be satisfied is irrelevant.</p><div id=cntmobi><p>|&ensp;<a href='quickstart.html'>Quick&nbsp;Start</a>&ensp;|&ensp;<a href='tutorial.html'>Tutorial</a>&ensp;|&ensp;<a href='tools.html'>Tools&nbsp;&amp;&nbsp;Languages</a>&ensp;|&ensp;<a href='examples.html'>Examples</a>&ensp;|&ensp;<a href='refflavors.html'>Reference</a>&ensp;|&ensp;<a href='books.html'>Book&nbsp;Reviews</a>&ensp;|</p><p>|&ensp;<a href='tutorial.html'>Introduction</a>&ensp;|&ensp;<a href='tutorialcnt.html'>Table of Contents</a>&ensp;|&ensp;<a href='characters.html'>Special Characters</a>&ensp;|&ensp;<a href='nonprint.html'>Non-Printable Characters</a>&ensp;|&ensp;<a href='engine.html'>Regex Engine Internals</a>&ensp;|&ensp;<a href='charclass.html'>Character Classes</a>&ensp;|&ensp;<a href='charclasssubtract.html'>Character Class Subtraction</a>&ensp;|&ensp;<a href='charclassintersect.html'>Character Class Intersection</a>&ensp;|&ensp;<a href='shorthand.html'>Shorthand Character Classes</a>&ensp;|&ensp;<a href='dot.html'>Dot</a>&ensp;|&ensp;<a href='anchors.html'>Anchors</a>&ensp;|&ensp;<a href='wordboundaries.html'>Word Boundaries</a>&ensp;|&ensp;<a href='alternation.html'>Alternation</a>&ensp;|&ensp;<a href='optional.html'>Optional Items</a>&ensp;|&ensp;<a href='repeat.html'>Repetition</a>&ensp;|&ensp;<a href='brackets.html'>Grouping &amp; Capturing</a>&ensp;|&ensp;<a href='backref.html'>Backreferences</a>&ensp;|&ensp;<a href='backref2.html'>Backreferences, part 2</a>&ensp;|&ensp;<a href='named.html'>Named Groups</a>&ensp;|&ensp;<a href='backrefrel.html'>Relative Backreferences</a>&ensp;|&ensp;<a href='branchreset.html'>Branch Reset Groups</a>&ensp;|&ensp;<a href='freespacing.html'>Free-Spacing &amp; Comments</a>&ensp;|&ensp;<a href='unicode.html'>Unicode</a>&ensp;|&ensp;<a href='modifiers.html'>Mode Modifiers</a>&ensp;|&ensp;<a href='atomic.html'>Atomic Grouping</a>&ensp;|&ensp;<a href='possessive.html'>Possessive Quantifiers</a>&ensp;|&ensp;<a href='lookaround.html'>Lookahead &amp; Lookbehind</a>&ensp;|&ensp;<a href='lookaround2.html'>Lookaround, part 2</a>&ensp;|&ensp;<a href='keep.html'>Keep Text out of The Match</a>&ensp;|&ensp;<a href='conditional.html'>Conditionals</a>&ensp;|&ensp;<a href='balancing.html'>Balancing Groups</a>&ensp;|&ensp;<a href='recurse.html'>Recursion</a>&ensp;|&ensp;<a href='subroutine.html'>Subroutines</a>&ensp;|&ensp;<a href='recurseinfinite.html'>Infinite Recursion</a>&ensp;|&ensp;<a href='recurserepeat.html'>Recursion &amp; Quantifiers</a>&ensp;|&ensp;<a href='recursecapture.html'>Recursion &amp; Capturing</a>&ensp;|&ensp;<a href='recursebackref.html'>Recursion &amp; Backreferences</a>&ensp;|&ensp;<a href='recursebacktrack.html'>Recursion &amp; Backtracking</a>&ensp;|&ensp;<a href='posixbrackets.html'>POSIX Bracket Expressions</a>&ensp;|&ensp;<a href='zerolength.html'>Zero-Length Matches</a>&ensp;|&ensp;<a href='continue.html'>Continuing Matches</a>&ensp;|</p></div> <div id=copyright> <P CLASS=copyright>Page URL: <A HREF="https://www.regular-expressions.info/lookaround.html" TARGET="_top">https://www.regular-expressions.info/lookaround.html</A><BR> Page last updated: 12 August 2021<BR> Site last updated: 06 November 2024<BR> Copyright &copy; 2003-2024 Jan Goyvaerts. All rights reserved.</P> </div> </div> </div> </body></html>

Pages: 1 2 3 4 5 6 7 8 9 10