Java – Regular expressions take 25 seconds to find a URL in HTML in Java/Android

Regular expressions take 25 seconds to find a URL in HTML in Java/Android… here is a solution to the problem.

Regular expressions take 25 seconds to find a URL in HTML in Java/Android

In Android/Java, given the HTML source code for a website, I want to extract all the XML and CSV file paths.

What I’m doing (using RegEx) is this:

final HashSet<String> urls = new HashSet<String>();
final Pattern urlRegex = Pattern.compile(
        "[-a-zA-Z0-9+&@#/%?=~_|!:,.; ]*[-a-zA-Z0-9+&@#/%=~_|]. (xml|csv)");
final Matcher url = urlRegex.matcher(htmlString);
while (url.find()) {
    urls.add(makeAbsoluteURL(url.group(0)));
}

public String makeAbsoluteURL(String url) {
    if (url.startsWith("http://") || url.startsWith("http://")) {
        return url;
    }
    else if (url.startsWith("/")) {
        return mRootURL+url.substring(1);
    }
    else {
        return mBaseURL+url;
    }
}

Unfortunately, for

a normal-length normal website, this runs for about 25 seconds. What went wrong? Is my RegEx bad? Or is RegEx too slow?

Can I find URLs faster without RegEx?

Edit:

The source of valid characters is (roughly) this answer. Provides a broader character set for all remaining characters. This is the intent.

Solution

Your regular expression is written in such a way that it is slower for long input.
The * operator is greedy.

For example, enter:
http://stackoverflow.com/questions/19019504/regex-to-find-urls-in-html-takes-25-seconds-in-java-android.xml

[-

a-zA-Z0-9+&@#/%?=~_|!:,.; The ]* part consumes the entire string. It will then try to match the next character set, which will fail (because the entire string is consumed). It will then backtrace one character in the match of the first part of the regular expression and try to match the second group of characters again. It will match. Then it will try to match the dots and fail because the entire string is consumed. Another backtracking etc….

Essentially, your regular expression forces a lot of backtracking to match anything. It also wastes a lot of time on races where there is no way to succeed.

For the word

forest, it consumes the entire word first in the first part of the expression, and then backtracks it repeatedly after it fails to match the rest of the expression. Waste a lot of time.

Also:

  • The . in the regular expression is unescaped and will match any character.
  • url.group(0) is redundant. url.group() is synonymous

To speed up regular expressions, you’ll need to come up with a way to reduce the amount of backtracking, which can also help if your race starts less generally. Now every word causes the match to start and generally fails. For example, usually in HTML, all links are within 2“. If this is the case, you can start matching with ", which will greatly speed up matching. Try to find a better expression to start with.

Related Problems and Solutions