Indexes the unclosed character class near nnn
I
borrowed a rather complex regular expression from some PHP Textile implementation (open source, properly attributed) for a simple, incomplete Java implementation, textile4j, which I was porting to github and syncing to Maven central (writing the original code to provide plugins for blojsom (a Java blogging platform); This is part of a larger effort to make blojsom dependencies available in Maven Central).
Unfortunately, textile regular expressions (although they work in the preg_replace_callback
context in PHP) fail in Java with the following exception:
java.util.regex.PatternSyntaxException: The unclosed character class near index 217
The statement is obvious, the solution is elusive.
This is the original multiline regular expression from the PHP implementation:
return preg_replace_callback('/
(^| (?<=[\s>.\(])| [{[]) # $pre
" # start
(' . $this->c . ') # $atts
([^"]+?) # $text
(?:\ (([^)]+?) \)(?="))? # $title
":
('.$this->urlch.' +?) # $url
(\/)? # $slash
([^\w\/; ]*?) # $post
([\]}]| (?=\s|$|\)))
/x',callback,input);
Cleverly, I got the textile class “show me the code” to use in this regular expression, with a simple echo
, and the result is the following fairly long regular expression:
(^|(? <=[\s>.\(])| [{[])"((?:(?:\ ([^)]+\))| (?:\ {[^}]+\})| (?:\ [[^]]+\])| (?:\ <(?! >)| (?<!<)\>|\<\>|\=| [()]+(?! ))) *)([^"]+?) (?:\ (([^)]+?) \)(?="))?":( [\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?) (\/)? ([^\w\/; ]*?) ([\]}]| (?=\s|$|\)))
Using online tools such as RegExr by gskinner), I found several areas that could lead to parsing errors. and RegexPlanet.
I
suspect a range issue is hidden in one of the character classes, or a Unicode order is hidden somewhere, but I can’t find it.
Any ideas?
I’m also curious why PHP doesn’t throw a similar error, for example, I found a “passive subexpression” that is mishandled using RegExr, but it doesn’t fix Java exceptions and doesn’t change the behavior PHP as shown below.
Toggle escaped parentheses in #title
:
(?:\ (([^)]+?) \)(?="))? # $title
...^
(?:( \([^)]+?) \)(?="))? # $title
....^
Thank you
Tim
EDIT: Add Java string interpretation (with escape characters) for Textile regular expressions, determined by RegexPlanet….
"(^|(? <=[\\s>.\\(])| [{[])\"((?:(?:\ \([^)]+\\))| (?:\ \{[^}]+\\})| (?:\ \[[^]]+\\])| (?:\ \<(?! >)| (?<!<)\\>|\\<\\>|\\=| [()]+(?! ))) *)([^\"]+?) (?:\ \(([^)]+?) \\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?) (\\/)? ([^\\w\\/; ]*?) ([\\]}]| (?=\\s|$|\\)))"
Solution
@CodeJockey is correct: you have a square bracket in one of your character classes that needs to be escaped. []
] or [^]]
will do, because ] is the first character except for negating ^
, but in Java, unescaped [
anywhere in a character class is a
syntax error.
Ironically, the original regular expression contains many backslashes that are not even needed in PHP. It also escapes /
because it is used as a regular expression delimiter. After clearing all of this, I came up with this Java regular expression :
"(^|(? <=[\\s>.(])| [{\\[])\"((?:(?:\ \([^)]+\\))| (?:\ \{[^}]+\\})| (?:\ \[[^]]+\\])| (?:<(?! >)| (?<!<)>|<>|=| [()]+(?! ))) *)([^\"]+?) (?:\ \(([^)]+?) \\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?) (/)? ([^\\w/; ]*?) ([]}]| (?=\\s|$|\\)))"
I don’t know if it’s the best regular expression or how it’s used.