Python problem with dashes in web scraping

Python problem with dashes in web scraping … here is a solution to the problem.

Python problem with dashes in web scraping

I have a simple script that can grab a link from Google and then crawl that link. However, some links contain dashes and for some reason it shows up in my script (in url) as %25E2%2580%2593. So it looks like this now: http://myaddress.com/search?q=The_%25E2%2580%2593_World when I want it to look like this http://myaddress .com/search?q=The_–_World. What should I do about it? Should I encode/decode with UTF-8?

Edit:
I tried double dereferencing (quoting this link) but to no avail. Instead, I get the following result: http://myaddress.com/search?q=The_–_World.

Solution

URLs appear to be double-URL-encoded.

To decode to its original form, perform double URL decoding using the parse.unquote function of the urllib library:

import urllib.parse

url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
urllib.parse.unquote(urllib.parse.unquote(url))

Decode to the desired “http://myaddress.com/search?q=The_–_World” URL.

Edit:

As you explained, you are using Python 2.7, and the equivalent decoding function will be unquote (url) (see documentation .) here)。

import urllib

url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
print(urllib.unquote(urllib.unquote(url))).decode('utf-8')

Output:

http://myaddress.com/search?q=The_–_World

Python problem with dashes in web scraping

Solution

Related Problems and Solutions