Python regular expressions match 2 different delimiters
I’m trying to make a regular expression that matches something like:
[
[uid::Page name|page alias]].
For example:
[[nw::Home|Home]].
Both the uid and the page alias are optional.
I want the separator ::
or |
to appear only once, and only in the order shown. However, the character :
should be allowed anywhere after the uid. That’s the problem.
The following regular expression works just fine, except it matches a string where
:: appears twice or appears in the wrong place:
regex = r'\[\[([\w]+::)?( [^|\t\n\r\f\v]+)(\|[ ^|\t\n\r\f\v]+)?\]\]'
re.match(regex, '[[Home]]') # matches, good
re.match(regex, '[[Home| Home page]]') # matches, good
re.match(regex, '[[nw::Home]]') # matches, good
re.match(regex, '[[nw::Home| Home page]]') # matches, good
re.match(regex, '[[nw| Home| Home page]]') # doesn't match, good
re.match(regex, '[[nw| Home::Home page]]') # matches, bad
re.match(regex, '[[nw::Home::Home page]]') # matches, bad
I’ve read all about negative precedence and last-line expressions, but I don’t know how to apply them in this case. Any suggestions would be appreciated.
EDIT: I would also like to know how to prevent the delimiter from being included in the matching results, as shown below:
('nw::', 'home page', '| home page').
Solution
If I understand your needs correctly, you can use this :
\[\[(?:(? <uid>\w+)::)? (?!. *::)(?<page>[^|\t\n\r\f\v]+)(?:\| (?<alias>[^|\t\n\r\f\v]+))? \]\]
^^^^^^^^
See here for a demo. I added a negative lookforward after the uid
capture.
I’ve named the captured groups, but if you don’t want them, that’s the one that doesn’t name the capture groups :
\[\[(?:( \w+)::)? (?!. *::)([^|\t\n\r\f\v]+)(?:\| ([^|\t\n\r\f\v]+))? \]\]