Python problem with dashes in web scraping
I have a simple script that can grab a link from Google and then crawl that link. However, some links contain dashes and for some reason it shows up in my script (in url) as %25E2%2580%2593
. So it looks like this now: http://myaddress.com/search?q=The_%25E2%2580%2593_World
when I want it to look like this http://myaddress .com/search?q=The_–_World
. What should I do about it? Should I encode/decode with UTF-8?
Edit:
I tried double dereferencing (quoting this link) but to no avail. Instead, I get the following result: http://myaddress.com/search?q=The_–_World
.
Solution
URLs appear to be double-URL-encoded.
To decode to its original form, perform double URL decoding using the parse.unquote
function of the urllib library:
import urllib.parse
url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
urllib.parse.unquote(urllib.parse.unquote(url))
Decode to the desired “http://myaddress.com/search?q=The_–_World” URL.
Edit:
As you explained, you are using Python 2.7, and the equivalent decoding function will be unquote (url)
(see documentation .) here)。
import urllib
url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
print(urllib.unquote(urllib.unquote(url))).decode('utf-8')
Output:
http://myaddress.com/search?q=The_–_World