Java – Indexes the unclosed character class near nnn

Indexes the unclosed character class near nnn… here is a solution to the problem.

Indexes the unclosed character class near nnn

I

borrowed a rather complex regular expression from some PHP Textile implementation (open source, properly attributed) for a simple, incomplete Java implementation, textile4j, which I was porting to github and syncing to Maven central (writing the original code to provide plugins for blojsom (a Java blogging platform); This is part of a larger effort to make blojsom dependencies available in Maven Central).

Unfortunately, textile regular expressions (although they work in the preg_replace_callback context in PHP) fail in Java with the following exception:

java.util.regex.PatternSyntaxException: The unclosed character class near index 217

The statement is obvious, the solution is elusive.

This is the original multiline regular expression from the PHP implementation:

return preg_replace_callback('/
    (^| (?<=[\s>.\(])| [{[]) # $pre
    "                      # start
    (' . $this->c . ')     # $atts
    ([^"]+?)               # $text
    (?:\ (([^)]+?) \)(?="))? # $title
    ":
    ('.$this->urlch.' +?)   # $url
    (\/)?                  # $slash
    ([^\w\/; ]*?)           # $post
    ([\]}]| (?=\s|$|\)))
    /x',callback,input);

Cleverly, I got the textile class “show me the code” to use in this regular expression, with a simple echo, and the result is the following fairly long regular expression:

(^|(? <=[\s>.\(])| [{[])"((?:(?:\ ([^)]+\))| (?:\ {[^}]+\})| (?:\ [[^]]+\])| (?:\ <(?! >)| (?<!<)\>|\<\>|\=| [()]+(?! ))) *)([^"]+?) (?:\ (([^)]+?) \)(?="))?":( [\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?) (\/)? ([^\w\/; ]*?) ([\]}]| (?=\s|$|\)))

Using online tools such as RegExr by gskinner), I found several areas that could lead to parsing errors. and RegexPlanet.

I

suspect a range issue is hidden in one of the character classes, or a Unicode order is hidden somewhere, but I can’t find it.

Any ideas?

I’m also curious why PHP doesn’t throw a similar error, for example, I found a “passive subexpression” that is mishandled using RegExr, but it doesn’t fix Java exceptions and doesn’t change the behavior PHP as shown below.

Toggle escaped parentheses in #title:

        (?:\ (([^)]+?) \)(?="))? # $title
        ...^
        (?:( \([^)]+?) \)(?="))? # $title
        ....^

Thank you
Tim

EDIT: Add Java string interpretation (with escape characters) for Textile regular expressions, determined by RegexPlanet….

"(^|(? <=[\\s>.\\(])| [{[])\"((?:(?:\ \([^)]+\\))| (?:\ \{[^}]+\\})| (?:\ \[[^]]+\\])| (?:\ \<(?! >)| (?<!<)\\>|\\<\\>|\\=| [()]+(?! ))) *)([^\"]+?) (?:\ \(([^)]+?) \\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?) (\\/)? ([^\\w\\/; ]*?) ([\\]}]| (?=\\s|$|\\)))"

Solution

@CodeJockey is correct: you have a square bracket in one of your character classes that needs to be escaped. []] or [^]] will do, because ] is the first character except for negating ^, but in Java, unescaped [ anywhere in a character class is a syntax error.

Ironically, the original regular expression contains many backslashes that are not even needed in PHP. It also escapes / because it is used as a regular expression delimiter. After clearing all of this, I came up with this Java regular expression :

"(^|(? <=[\\s>.(])| [{\\[])\"((?:(?:\ \([^)]+\\))| (?:\ \{[^}]+\\})| (?:\ \[[^]]+\\])| (?:<(?! >)| (?<!<)>|<>|=| [()]+(?! ))) *)([^\"]+?) (?:\ \(([^)]+?) \\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?) (/)? ([^\\w/; ]*?) ([]}]| (?=\\s|$|\\)))"

I don’t know if it’s the best regular expression or how it’s used.

Related Problems and Solutions