Python, BeautifulSoup looks for HTML fragments
I’m a newbie and just want to learn about web scraping examples in automate the boring stuff. What I’m trying is automatically downloading images from phdcomics will in a python code
Find the link to the image from the HTML and download it
From the HTML, find the link to the previous page and go there, repeat step 1 until the first page.
For downloading the current page image, print the HTML snippet after soup.prettify() as follows
<meta content="Link to Piled Higher and Deeper" name="description">
<meta content="PHD Comic: Remind me" name="title">
<link
href="http://www.phdcomics.com/comics/archive/phd041218s.gif" rel="image_src">
<div class="jumbotron" style="background-color:#52697d; padding: 0em 0em 0em; margin-top:0px; margin-bottom: 0px; background-image: url('http://phdcomics.com/images/bkg_bottom_stuff3.png'); background-repeat: repeat-x; ">
<div align="center" class="container-fluid" style="max-width: 1800px; padding-left: 0px; padding-right:0px; ">
And then when I wrote
newurl=soup.find('link', {'rel': "image_src"}).get('href')
It gave me what I needed, which is
“http://www.phdcomics.com/comics/archive/phd041218s.gif ”
In the next step, when I want to find the previous page link, I think it’s in the following part of the HTML code –
<!-- Comic Table --!>
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td align="right" valign="top">
<a href=http://phdcomics.com/comics/archive.php?comicid=2004><img height=52 width=49 src=http://phdcomics.com/comics/images/prev_button.gif border=0 align=middle ><br></a><font
face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>previous </b></i></font><br><br><a href=http:// phdcomics.com/comics/archive.php?comicid=1995><img src=http://phdcomics.com/comics/images/jump_bck10.gif border=0></a><br><a href=http:// phdcomics.com/comics/archive.php?comicid=2000><img src=http://phdcomics.com/comics/images/jump_bck5.gif border=0></a><br><font face=Arial, Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>jump</b></i></font><br><br><a href=http://phdcomics.com/comics/ archive.php?comicid=1><img src=http://phdcomics.com/comics/images/first_button.gif border=0 align=middle><br></a><font face=Arial,Helvetica,Geneva ,Swiss,SunSans-Regular size=-1><i><b>first</b></i></font><br><br> </td>
<td align="center" valign="top"><font color="black">
From this part of the code I am looking for
=http://phdcomics.com/comics/archive.php?comicid=2004
As a link from my previous one.
When I try something like this –
Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
print(Prevlink)
It gives me errors like this-
Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'
Even if I try to do it –
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print(Prevlink)
I get a similar error –
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'
What should be the right way to get the right ‘href’?
Time difference
Solution
The problem is how to add comments on the html of Phd comics.
If you look closely at the output of soup.prettify(), you’ll notice comments like this
<!-- Comic Table --!>
When it should,
<!-- Comic Table -->
This causes BeautifulSoup to miss certain tags. There are many ways to parse and remove comments, such as using regular expressions, comments, but it can be difficult to get them to work in this case. The easiest way is to fix the comment tag after collecting the html.
from bs4 import BeautifulSoup
import requests
url = "https://phdcomics.com/"
r = requests.get(url)
data = r.text
data = data.replace("--!>","-->") # fix comments
soup = BeautifulSoup(data)
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print Prevlink
http://phdcomics.com/comics/archive.php?comicid=2004
Update:
To automatically find the requested link, we need to find the “http://phdcomics.com/comics/images/prev_button.gif and extract the link
img_tag = soup.find('img',{'src':'http://phdcomics.com/comics/images/prev_button.gif'})
print img_tag.find_parent().get('href')
http://phdcomics.com/comics/archive.php?comicid=2005