Regexp with brace matching

Neamar

This page is outdated. See Recursive patterns for a native implementation.

Sometimes, you need your regexp to do more than PCRE can provide. For example, if you were to try to do brace matching with regexp, you'll soon find it's simply impossible to do such a thing (unless you're using perl6, in which case you should see this page).

What's this ?

First, what's brace matching ?
Let's say you want to find all texts between {} ; for instance for LaTeX-like support.
So you'll write something like #{(.+)}#U.
If your input is Some {input provided} by the user, everything will be fine and you'll get the expected "input provided".
But let's say your user write this : Some {input {provided} and encapsulated} by the user. This time, all you'll get is "input { provided". Which is obviously an error...

Example

This page covers a solution for this problem. With the code below, heavy-braced input like this one (who even crash GeShi !):

Code source : input.tex
\i{italics and \b{bold \l[URL]{a link \textms{monospace content \big{big content}monospace again}}:)}}

... will be translated to the HTML result you are looking for :

Code source : resultat.html
<em>italics and<strong>bold <a href="#URL">a link <span class="ms">monospace content <big>big content</big> monospace again</span></a>: )</strong></em>

Solution

The trick is to override the default function (PHP preg_replace in this case, although the solution can be adapted to any language) and to look for any braces. This solution is clearly incomplete and simple, but you may build more complex layer of code on it once you understand the principles behind.

Code source : preg_replace_wb
<?php
function preg_replace_wb($pattern,$replacement,&$subject)
{
preg_match_all($pattern,$subject,$MatchingString,PREG_SET_ORDER);//$MatchingString now hold all strings matching $pattern.

$SVGReplacement=$replacement;
foreach($MatchingString as $Result)
{//For each result :
$replacement=$SVGReplacement;
foreach($Result as $n=>$Match)
{//And for each capturing parenthesis
if($n>0 && strpos($Match,'{')!==false)
{//There is at least one brace in our string, we'll need to improve the regexp.
$InitialMatch=$Match;
$Offset=strpos($subject,$Match,strpos($subject,$Result[0])) + 1;//Switch the cursor to the beginning of our string in the whole $subject.
$Start=$Offset - 1;//$Offset start right after the brace : for \i{something}, $Offset is on the "s" and $Start on the "{".
$Size=strlen($subject)-1;//We need to compute it every time cause every match may change the size of our string.
$NestingLevel=0;//How deep are we ?

while($NestingLevel>=0)
{//Browse the string, looking for braces...
$Open=strpos($subject,'{',$Offset);//Find the next opening brace
$Close=strpos($subject,'}',$Offset);//Find the next closing
if($Close!==false && ($Open===false || $Close<$Open))//Closest brace is a closing one.
{
$NestingLevel--;
$Offset=$Close+1;//Move the cursor to it's new position
}
elseif($Open!==false && ($Close===false || $Open<$Close))//Closest brace is an opening one
{
$NestingLevel++;
$Offset=$Open+1;//Move the cursor to it's new position
}
elseif($Open===false && $Close==false)
break;//Uh oh... something is wrong !
}
$Offset--;

if($NestingLevel>=0)
exit("Not enough braces, at least one closing brace seems to be missing");

$Match=substr($subject,$Start,$Offset-$Start);//Replace the regexp Match with the real Match we just computed.
$Result[0]=str_replace($InitialMatch,$Match,$Result[0]);//Change the whole match to reflect our new choice.
}
$replacement=str_replace('$' . $n,$Match,$replacement);//Compute $replacement string.
}
$subject=str_replace($Result[0],$replacement,$subject);//Replace in subject with computed $replacement.
}
return $subject;
}

Limitation

This piece of code will only work if you're looking for one pair of brace and no more : codes such as this one #{(.+)}{(.+)#U will result in unpredictable behavior. If you want to match such case, you'll need to code your solution... or to improve this one.

Examples

Code source : example
<?php
echo preg_replace_wb('#\\\\emph{(.+)}#isU','<em>$1</em>','\\emph{Hello {there} !}'); //<em>Hello {there} !</em>

$Balise=array
(
'#\\\\emph{(.+)}#isU'=>'<em>$1</em>',
'#\\\\big{(.+)}#isU'=>'<big>$1</big>',
'#\\\\acronym\[(.+)\]{(.+)}#isU'=>'<acronym title="$1">$2</acronym>',//Acronym. Please note interlaced [] aren't treated on this article
);
echo preg_replace_wb(array_keys($Balise),array_values($Balise),'The \\emph{\\acronym[Federal Bureau of Investigation]{F.\\big{B}.I.}} killed me.');//. Output : The <i><acronym title="Federal Bureau of Investigation">F.<big>B</big>.I.</acronym></i> killed me.

This snippet is used on Le Typographe, a french application for text formatting.

Auteur
Neamar
Date
2009
Menu
Index des ressources

Chargement du sommaire...