CINXE.COM
Regex Tutorial - Backreferences To Match The Same Text Again
<!DOCTYPE html> <html lang="en"><head><meta charset="utf-8"><link rel=canonical href='https://https://www.regular-expressions.info//backref.html'><title>Regex Tutorial - Backreferences To Match The Same Text Again</title> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="author" content="Jan Goyvaerts"> <meta name="description" content="In a regular expression, parentheses can be used to group regex tokens together and for creating backreferences. Backreferences allow you to reuse part of the regex match in the regex, or in the replacement text."> <meta name="keywords" content=""> <link rel=stylesheet href="regex.css" type="text/css"><script src="theme.js" type="text/javascript"></script><link rel="alternate" type="application/rss+xml" title="New at Regular-Expressions.info" href="updates.xml"> </head> <body bgcolor=white text=black> <div id=top></div> <div id=btntop><div id=btngrid><a href="quickstart.html" target="_top"><div>Quick Start</div></a><a href="tutorial.html" target="_top"><div>Tutorial</div></a><a href="tools.html" target="_top"><div>Tools & Languages</div></a><a href="examples.html" target="_top"><div>Examples</div></a><a href="refflavors.html" target="_top"><div>Reference</div></a><a href="books.html" target="_top"><div>Book Reviews</div></a></div></div> <div id=contents><div id=side> <TABLE CLASS=side CELLSPACING=0 CELLPADDING=4><TR><TD CLASS=sideheader>Regex Tutorial</TD></TR><TR><TD><A HREF="tutorial.html" TARGET=_top>Introduction</A></TD></TR><TR><TD><A HREF="tutorialcnt.html" TARGET=_top>Table of Contents</A></TD></TR><TR><TD><A HREF="characters.html" TARGET=_top>Special Characters</A></TD></TR><TR><TD><A HREF="nonprint.html" TARGET=_top>Non-Printable Characters</A></TD></TR><TR><TD><A HREF="engine.html" TARGET=_top>Regex Engine Internals</A></TD></TR><TR><TD><A HREF="charclass.html" TARGET=_top>Character Classes</A></TD></TR><TR><TD><A HREF="charclasssubtract.html" TARGET=_top>Character Class Subtraction</A></TD></TR><TR><TD><A HREF="charclassintersect.html" TARGET=_top>Character Class Intersection</A></TD></TR><TR><TD><A HREF="shorthand.html" TARGET=_top>Shorthand Character Classes</A></TD></TR><TR><TD><A HREF="dot.html" TARGET=_top>Dot</A></TD></TR><TR><TD><A HREF="anchors.html" TARGET=_top>Anchors</A></TD></TR><TR><TD><A HREF="wordboundaries.html" TARGET=_top>Word Boundaries</A></TD></TR><TR><TD><A HREF="alternation.html" TARGET=_top>Alternation</A></TD></TR><TR><TD><A HREF="optional.html" TARGET=_top>Optional Items</A></TD></TR><TR><TD><A HREF="repeat.html" TARGET=_top>Repetition</A></TD></TR><TR><TD><A HREF="brackets.html" TARGET=_top>Grouping & Capturing</A></TD></TR><TR><TD><A HREF="backref.html" TARGET=_top>Backreferences</A></TD></TR><TR><TD><A HREF="backref2.html" TARGET=_top>Backreferences, part 2</A></TD></TR><TR><TD><A HREF="named.html" TARGET=_top>Named Groups</A></TD></TR><TR><TD><A HREF="backrefrel.html" TARGET=_top>Relative Backreferences</A></TD></TR><TR><TD><A HREF="branchreset.html" TARGET=_top>Branch Reset Groups</A></TD></TR><TR><TD><A HREF="freespacing.html" TARGET=_top>Free-Spacing & Comments</A></TD></TR><TR><TD><A HREF="unicode.html" TARGET=_top>Unicode</A></TD></TR><TR><TD><A HREF="modifiers.html" TARGET=_top>Mode Modifiers</A></TD></TR><TR><TD><A HREF="atomic.html" TARGET=_top>Atomic Grouping</A></TD></TR><TR><TD><A HREF="possessive.html" TARGET=_top>Possessive Quantifiers</A></TD></TR><TR><TD><A HREF="lookaround.html" TARGET=_top>Lookahead & Lookbehind</A></TD></TR><TR><TD><A HREF="lookaround2.html" TARGET=_top>Lookaround, part 2</A></TD></TR><TR><TD><A HREF="keep.html" TARGET=_top>Keep Text out of The Match</A></TD></TR><TR><TD><A HREF="conditional.html" TARGET=_top>Conditionals</A></TD></TR><TR><TD><A HREF="balancing.html" TARGET=_top>Balancing Groups</A></TD></TR><TR><TD><A HREF="recurse.html" TARGET=_top>Recursion</A></TD></TR><TR><TD><A HREF="subroutine.html" TARGET=_top>Subroutines</A></TD></TR><TR><TD><A HREF="recurseinfinite.html" TARGET=_top>Infinite Recursion</A></TD></TR><TR><TD><A HREF="recurserepeat.html" TARGET=_top>Recursion & Quantifiers</A></TD></TR><TR><TD><A HREF="recursecapture.html" TARGET=_top>Recursion & Capturing</A></TD></TR><TR><TD><A HREF="recursebackref.html" TARGET=_top>Recursion & Backreferences</A></TD></TR><TR><TD><A HREF="recursebacktrack.html" TARGET=_top>Recursion & Backtracking</A></TD></TR><TR><TD><A HREF="posixbrackets.html" TARGET=_top>POSIX Bracket Expressions</A></TD></TR><TR><TD><A HREF="zerolength.html" TARGET=_top>Zero-Length Matches</A></TD></TR><TR><TD><A HREF="continue.html" TARGET=_top>Continuing Matches</A></TD></TR> </TABLE><TABLE CLASS=side CELLSPACING=0 CELLPADDING=4><TR><TD CLASS=sideheader>More on This Site</TD></TR><TR><TD><A HREF="index.html" TARGET=_top>Introduction</A></TD></TR><TR><TD><A HREF="quickstart.html" TARGET=_top>Regular Expressions Quick Start</A></TD></TR><TR><TD><A HREF="tutorial.html" TARGET=_top>Regular Expressions Tutorial</A></TD></TR><TR><TD><A HREF="replacetutorial.html" TARGET=_top>Replacement Strings Tutorial</A></TD></TR><TR><TD><A HREF="tools.html" TARGET=_top>Applications and Languages</A></TD></TR><TR><TD><A HREF="examples.html" TARGET=_top>Regular Expressions Examples</A></TD></TR><TR><TD><A HREF="refflavors.html" TARGET=_top>Regular Expressions Reference</A></TD></TR><TR><TD><A HREF="refreplace.html" TARGET=_top>Replacement Strings Reference</A></TD></TR><TR><TD><A HREF="books.html" TARGET=_top>Book Reviews</A></TD></TR><TR><TD><A HREF="print.html" TARGET=_top>Printable PDF</A></TD></TR><TR><TD><A HREF="about.html" TARGET=_top>About This Site</A></TD></TR><TR><TD><A HREF="updates.html" TARGET=_top>RSS Feed & Blog</A></TD></TR></TABLE></DIV><div class=bodytext><div class=topad style="height:130px"><A HREF="https://www.regexbuddy.com/create.html" TARGET="_top"><picture><source media="(max-width: 370px)" srcset="ads/320/rxbtutorial100.png 1x, ads/320/rxbtutorial150.png 1.5x, ads/320/rxbtutorial200.png 2x, ads/320/rxbtutorial250.png 2.5x, ads/320/rxbtutorial300.png 3x, ads/320/rxbtutorial350.png 3.5x, ads/320/rxbtutorial400.png 4x"><source media="(max-width: 500px)" srcset="ads/360/rxbtutorial100.png 1x, ads/360/rxbtutorial150.png 1.5x, ads/360/rxbtutorial200.png 2x, ads/360/rxbtutorial250.png 2.5x, ads/360/rxbtutorial300.png 3x, ads/360/rxbtutorial350.png 3.5x, ads/360/rxbtutorial400.png 4x"><source media="(max-width: 660px)" srcset="ads/480/rxbtutorial100.png 1x, ads/480/rxbtutorial150.png 1.5x, ads/480/rxbtutorial200.png 2x, ads/480/rxbtutorial250.png 2.5x, ads/480/rxbtutorial300.png 3x, ads/480/rxbtutorial350.png 3.5x, ads/480/rxbtutorial400.png 4x"><source media="(max-width: 747px)" srcset="ads/640/rxbtutorial100.png 1x, ads/640/rxbtutorial150.png 1.5x, ads/640/rxbtutorial200.png 2x, ads/640/rxbtutorial250.png 2.5x, ads/640/rxbtutorial300.png 3x, ads/640/rxbtutorial350.png 3.5x, ads/640/rxbtutorial400.png 4x"><img src="ads/728/rxbtutorial100.png" srcset="ads/728/rxbtutorial100.png 1x, ads/728/rxbtutorial125.png 1.25x, ads/728/rxbtutorial150.png 1.5x, ads/728/rxbtutorial175.png 1.75x, ads/728/rxbtutorial200.png 2x, ads/728/rxbtutorial250.png 2.5x, ads/728/rxbtutorial300.png 3x, ads/728/rxbtutorial350.png 3.5x, ads/728/rxbtutorial400.png 4x" alt="RegexBuddy—Better than a regular expression tutorial!"></picture></A></div> <div class=bulb><h1>Using Backreferences To Match The Same Text Again</h1><script type="text/javascript">showbulb();</script></div> <p>Backreferences match the same text as previously matched by a capturing group. Suppose you want to match a pair of opening and closing HTML tags, and the text in between. By putting the opening tag into a backreference, we can reuse the name of the tag for the closing tag. Here’s how: <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccrange">0-9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">\b</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexplain">></SPAN><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN><SPAN CLASS="regexplain"></</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexplain">></SPAN></TT>. This regex contains only one pair of parentheses, which capture the string matched by <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT>. This is the opening HTML tag. (Since HTML tags are case insensitive, this regex requires case insensitive matching.) The backreference <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> (backslash one) references the first capturing group. <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> matches the exact same text that was matched by the first capturing group. The <TT CLASS=syntax><SPAN CLASS="regexplain">/</SPAN></TT> before it is a literal character. It is simply the forward slash in the closing HTML tag that we are trying to match.</p> <p>To figure out the number of a particular backreference, scan the regular expression from left to right. Count the opening parentheses of all the numbered capturing groups. The first parenthesis starts backreference number one, the second number two, etc. Skip parentheses that are part of other syntax such as non-capturing groups. This means that non-capturing parentheses have another benefit: you can insert them into a regular expression without changing the numbers assigned to the backreferences. This can be very useful when modifying a complex regular expression.</p> <p>You can reuse the same backreference more than once. <TT CLASS=syntax><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">a-c</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">x</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexplain">x</SPAN><SPAN CLASS="regexspecial">\1</SPAN></TT> matches <tt class=match>axaxa</tt>, <tt class=match>bxbxb</tt> and <tt class=match>cxcxc</tt>.</p> <p>Most regex flavors support up to 99 capturing groups and double-digit backreferences. So <TT CLASS=syntax><SPAN CLASS="regexspecial">\99</SPAN></TT> is a valid backreference if your regex has 99 capturing groups.</p> <h2>Looking Inside The Regex Engine</h2> <p>Let’s see how the regex engine applies the regex <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccrange">0-9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">\b</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexplain">></SPAN><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN><SPAN CLASS="regexplain"></</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexplain">></SPAN></TT> to the string <tt class=string>Testing <B><I>bold italic</I></B> text</tt>. The first token in the regex is the literal <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN></TT>. The regex engine traverses the string until it can match at the first <tt class=match><</tt> in the string. The next token is <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccopen">]</SPAN></TT>. The regex engine also takes note that it is now inside the first pair of capturing parentheses. <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccopen">]</SPAN></TT> matches <tt class=match>B</tt>. The engine advances to <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN></TT> and <tt class=string>></tt>. This match fails. However, because of the <A HREF="repeat.html" TARGET="_top">star</A>, that’s perfectly fine. The position in the string remains at <tt class=string>></tt>. The <A HREF="wordboundaries.html" TARGET="_top">word boundary</A> <TT CLASS=syntax><SPAN CLASS="regexspecial">\b</SPAN></TT> matches at the <tt class=string>></tt> because it is preceded by <tt class=string>B</tt>. The word boundary does not make the engine advance through the string. The position in the regex is advanced to <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN></TT>.</p> <p>This step crosses the closing bracket of the first pair of capturing parentheses. This prompts the regex engine to store what was matched inside them into the first backreference. In this case, <tt class=match>B</tt> is stored.</p> <p>After storing the backreference, the engine proceeds with the match attempt. <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN></TT> does not match <tt class=match>></tt>. Again, because of another star, this is not a problem. The position in the string remains at <tt class=string>></tt>, and position in the regex is advanced to <TT CLASS=syntax><SPAN CLASS="regexccliteral">></SPAN></TT>. These obviously match. The next token is a dot, repeated by a lazy star. Because of the laziness, the regex engine initially skips this token, taking note that it should backtrack in case the remainder of the regex fails.</p> <p>The engine has now arrived at the second <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN></TT> in the regex, and the second <tt class=string><</tt> in the string. These match. The next token is <TT CLASS=syntax><SPAN CLASS="regexplain">/</SPAN></TT>. This does not match <tt class=string>I</tt>, and the engine is forced to backtrack to the dot. The dot matches the second <tt class=match><</tt> in the string. The star is still lazy, so the engine again takes note of the available backtracking position and advances to <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN></TT> and <tt class=string>I</tt>. These do not match, so the engine again backtracks.</p> <p>The backtracking continues until the dot has consumed <tt class=match><I>bold italic</tt>. At this point, <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN></TT> matches the third <tt class=match><</tt> in the string, and the next token is <TT CLASS=syntax><SPAN CLASS="regexplain">/</SPAN></TT> which matches <tt class=match>/</tt>. The next token is <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT>. Note that the token is the backreference, and not <TT CLASS=syntax><SPAN CLASS="regexplain">B</SPAN></TT>. The engine does not substitute the backreference in the regular expression. Every time the engine arrives at the backreference, it reads the value that was stored. This means that if the engine had backtracked beyond the first pair of capturing parentheses before arriving the second time at <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT>, the new value stored in the first backreference would be used. But this did not happen here, so <tt class=match>B</tt> it is. This fails to match at <tt class=string>I</tt>, so the engine backtracks again, and the dot consumes the third <tt class=string><</tt> in the string.</p> <p>Backtracking continues again until the dot has consumed <tt class=match><I>bold italic</I></tt>. At this point, <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN></TT> matches <tt class=match><</tt> and <TT CLASS=syntax><SPAN CLASS="regexplain">/</SPAN></TT> matches <tt class=match>/</tt>. The engine arrives again at <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT>. The backreference still holds <tt class=match>B</tt>. <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> matches <tt class=match>B</tt>. The last token in the regex, <TT CLASS=syntax><SPAN CLASS="regexccliteral">></SPAN></TT> matches <tt class=match>></tt>. A complete match has been found: <tt class=match><B><I>bold italic</I></B></tt>.</p> <h2>Backtracking Into Capturing Groups</h2> <p>You may have wondered about the word boundary <TT CLASS=code><SPAN CLASS="regexspecial">\b</SPAN></TT> in the <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccrange">0-9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">\b</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexplain">></SPAN><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN><SPAN CLASS="regexplain"></</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexplain">></SPAN></TT> mentioned above. This is to make sure the regex won’t match incorrectly paired tags such as <tt class=string><boo>bold</b></tt>. You may think that cannot happen because the capturing group matches <tt class=match>boo</tt> which causes <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> to try to match the same, and fail. That is indeed what happens. But then the regex engine backtracks.</p> <p>Let’s take the regex <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccrange">0-9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexplain">></SPAN><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN><SPAN CLASS="regexplain"></</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexplain">></SPAN></TT> without the word boundary and look inside the regex engine at the point where <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> fails the first time. First, <TT CLASS=syntax><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN></TT> continues to expand until it has reached the end of the string, and <TT CLASS=syntax><SPAN CLASS="regexplain"></</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexplain">></SPAN></TT> has failed to match each time <TT CLASS=syntax><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN></TT> matched one more character.</p> <p>Then the regex engine backtracks into the capturing group. <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> has matched <tt class=match>oo</tt>, but would just as happily match <tt class=match>o</tt> or nothing at all. When backtracking, <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> has now matched <tt class=match>bo</tt>, that is what is stored into the capturing group, overwriting <tt class=match>boo</tt> that was stored before. <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> matches the second <tt class=match>o</tt> in the opening tag. <TT CLASS=syntax><SPAN CLASS="regexplain">></SPAN><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN><SPAN CLASS="regexplain"></</SPAN></TT> matches <tt class=match>>bold</</tt>. <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> fails again.</p> <p>The regex engine does all the same backtracking once more, until <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> is forced to give up another character, causing it to match nothing, which the <A HREF="repeat.html" TARGET="_top">star</A> allows. The capturing group now stores just <tt class=match>b</tt>. <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> now matches <tt class=match>oo</tt>. <TT CLASS=syntax><SPAN CLASS="regexplain">></SPAN><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN><SPAN CLASS="regexplain"></</SPAN></TT> once again matches <tt class=match>>bold<</tt>. <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT> now succeeds, as does <TT CLASS=syntax><SPAN CLASS="regexccliteral">></SPAN></TT> and an overall match is found. But not the one we wanted.</p> <p>There are several solutions to this. One is to use the word boundary. When <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> backtracks the first time, reducing the capturing group to <tt class=match>bo</tt>, <TT CLASS=syntax><SPAN CLASS="regexspecial">\b</SPAN></TT> fails to match between <tt class=string>o</tt> and <tt class=string>o</tt>. This forces <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> to backtrack again immediately. The capturing group is reduced to <tt class=match>b</tt> and the word boundary fails between <tt class=string>b</tt> and <tt class=string>o</tt>. There are no further backtracking positions, so the whole match attempt fails.</p> <p>The reason we need the word boundary is that we’re using <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">^</SPAN><SPAN CLASS="regexccliteral">></SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> to skip over any attributes in the tag. If your paired tags never have any attributes, you can leave that out, and use <TT CLASS=syntax><SPAN CLASS="regexplain"><</SPAN><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A-Z</SPAN><SPAN CLASS="regexccrange">0-9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">></SPAN><SPAN CLASS="regexspecial">.</SPAN><SPAN CLASS="regexspecial">*</SPAN><SPAN CLASS="regexspecial">?</SPAN><SPAN CLASS="regexplain"></</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexplain">></SPAN></TT>. Each time <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccrange">A</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">Z</SPAN><SPAN CLASS="regexccrange">0</SPAN><SPAN CLASS="regexccrange">-</SPAN><SPAN CLASS="regexccrange">9</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">*</SPAN></TT> backtracks, the <TT CLASS=syntax><SPAN CLASS="regexplain">></SPAN></TT> that follows it fails to match, quickly ending the match attempt.</p> <p>If you don’t want the regex engine to backtrack into capturing groups, you can use an atomic group. The tutorial section on <A HREF="atomic.html" TARGET="_top">atomic grouping</A> has all the details.</p> <a name="repeat"></a><h2>Repetition and Backreferences</h2> <p>As I mentioned in the above inside look, the regex engine does not permanently substitute backreferences in the regular expression. It will use the last match saved into the backreference each time it needs to be used. If a new match is found by capturing parentheses, the previously saved match is overwritten. There is a <A HREF="captureall.html" TARGET="_top">clear difference</A> between <TT CLASS=syntax><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccliteral">abc</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexnest1">)</SPAN></TT> and <TT CLASS=syntax><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccliteral">abc</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">+</SPAN></TT>. Though both successfully match <tt class=match>cab</tt>, the first regex will put <tt class=match>cab</tt> into the first backreference, while the second regex will only store <tt class=match>b</tt>. That is because in the second regex, the plus caused the pair of parentheses to repeat three times. The first time, <tt class=match>c</tt> was stored. The second time, <tt class=match>a</tt>, and the third time <tt class=match>b</tt>. Each time, the previous value was overwritten, so <tt class=match>b</tt> remains.</p> <p>This also means that <TT CLASS=syntax><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccliteral">abc</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexplain">=</SPAN><SPAN CLASS="regexspecial">\1</SPAN></TT> will match <tt class=match>cab=cab</tt>, and that <TT CLASS=syntax><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccliteral">abc</SPAN><SPAN CLASS="regexccopen">]</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexplain">=</SPAN><SPAN CLASS="regexspecial">\1</SPAN></TT> will not. The reason is that when the engine arrives at <TT CLASS=syntax><SPAN CLASS="regexspecial">\1</SPAN></TT>, it holds <TT CLASS=syntax><SPAN CLASS="regexplain">b</SPAN></TT> which fails to match <tt class=string>c</tt>. Obvious when you look at a simple example like this one, but a common cause of difficulty with regular expressions nonetheless. When using backreferences, always double check that you are really capturing what you want.</p> <h2>Useful Example: Checking for Doubled Words</h2> <p>When editing text, doubled words such as “the the” easily creep in. Using the regex <TT CLASS=syntax><SPAN CLASS="regexspecial">\b</SPAN><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexspecial">\w</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexspecial">\s</SPAN><SPAN CLASS="regexspecial">+</SPAN><SPAN CLASS="regexspecial">\1</SPAN><SPAN CLASS="regexspecial">\b</SPAN></TT> in your <A HREF="editpadpro.html" TARGET="_top">text editor</A>, you can easily find them. To delete the second word, simply type in <tt class=string>\1</tt> as the replacement text and click the Replace button.</p> <h3>Parentheses and Backreferences Cannot Be Used Inside Character Classes</h3> <p>Parentheses cannot be used inside <A HREF="charclass.html" TARGET="_top">character classes</A>, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex <TT CLASS=syntax><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccliteral">(a)b</SPAN><SPAN CLASS="regexccopen">]</SPAN></TT> matches <tt class=match>a</tt>, <tt class=match>b</tt>, <tt class=match>(</tt>, and <tt class=match>)</tt>.</p> <p>Backreferences, too, cannot be used inside a character class. The \1 in a regex like <TT CLASS=syntax><SPAN CLASS="regexnest1">(</SPAN><SPAN CLASS="regexplain">a</SPAN><SPAN CLASS="regexnest1">)</SPAN><SPAN CLASS="regexccopen">[</SPAN><SPAN CLASS="regexccspecial">\1</SPAN><SPAN CLASS="regexccliteral">b</SPAN><SPAN CLASS="regexccopen">]</SPAN></TT> is either an error or a needlessly escaped literal 1. In <A HREF="javascript.html" TARGET="_top">JavaScript</A> it’s an <a href="nonprint.html#octal">octal escape</a>.</p><div id=cntmobi><p>| <a href='quickstart.html'>Quick Start</a> | <a href='tutorial.html'>Tutorial</a> | <a href='tools.html'>Tools & Languages</a> | <a href='examples.html'>Examples</a> | <a href='refflavors.html'>Reference</a> | <a href='books.html'>Book Reviews</a> |</p><p>| <a href='tutorial.html'>Introduction</a> | <a href='tutorialcnt.html'>Table of Contents</a> | <a href='characters.html'>Special Characters</a> | <a href='nonprint.html'>Non-Printable Characters</a> | <a href='engine.html'>Regex Engine Internals</a> | <a href='charclass.html'>Character Classes</a> | <a href='charclasssubtract.html'>Character Class Subtraction</a> | <a href='charclassintersect.html'>Character Class Intersection</a> | <a href='shorthand.html'>Shorthand Character Classes</a> | <a href='dot.html'>Dot</a> | <a href='anchors.html'>Anchors</a> | <a href='wordboundaries.html'>Word Boundaries</a> | <a href='alternation.html'>Alternation</a> | <a href='optional.html'>Optional Items</a> | <a href='repeat.html'>Repetition</a> | <a href='brackets.html'>Grouping & Capturing</a> | <a href='backref.html'>Backreferences</a> | <a href='backref2.html'>Backreferences, part 2</a> | <a href='named.html'>Named Groups</a> | <a href='backrefrel.html'>Relative Backreferences</a> | <a href='branchreset.html'>Branch Reset Groups</a> | <a href='freespacing.html'>Free-Spacing & Comments</a> | <a href='unicode.html'>Unicode</a> | <a href='modifiers.html'>Mode Modifiers</a> | <a href='atomic.html'>Atomic Grouping</a> | <a href='possessive.html'>Possessive Quantifiers</a> | <a href='lookaround.html'>Lookahead & Lookbehind</a> | <a href='lookaround2.html'>Lookaround, part 2</a> | <a href='keep.html'>Keep Text out of The Match</a> | <a href='conditional.html'>Conditionals</a> | <a href='balancing.html'>Balancing Groups</a> | <a href='recurse.html'>Recursion</a> | <a href='subroutine.html'>Subroutines</a> | <a href='recurseinfinite.html'>Infinite Recursion</a> | <a href='recurserepeat.html'>Recursion & Quantifiers</a> | <a href='recursecapture.html'>Recursion & Capturing</a> | <a href='recursebackref.html'>Recursion & Backreferences</a> | <a href='recursebacktrack.html'>Recursion & Backtracking</a> | <a href='posixbrackets.html'>POSIX Bracket Expressions</a> | <a href='zerolength.html'>Zero-Length Matches</a> | <a href='continue.html'>Continuing Matches</a> |</p></div> <div id=copyright> <P CLASS=copyright>Page URL: <A HREF="https://www.regular-expressions.info/backref.html" TARGET="_top">https://www.regular-expressions.info/backref.html</A><BR> Page last updated: 12 August 2021<BR> Site last updated: 06 November 2024<BR> Copyright © 2003-2024 Jan Goyvaerts. All rights reserved.</P> </div> </div> </div> </body></html>