Python, BeautifulSoup looks for HTML fragments

Python, BeautifulSoup looks for HTML fragments … here is a solution to the problem.

Python, BeautifulSoup looks for HTML fragments

I’m a newbie and just want to learn about web scraping examples in automate the boring stuff. What I’m trying is automatically downloading images from phdcomics will in a python code

  • Find the link to the image from the HTML and download it

  • From the HTML, find the link to the previous page and go there, repeat step 1 until the first page.

For downloading the current page image, print the HTML snippet after soup.prettify() as follows

<meta content="Link to Piled Higher and Deeper" name="description">
 <meta content="PHD Comic: Remind me" name="title">
  <link 
href="http://www.phdcomics.com/comics/archive/phd041218s.gif" rel="image_src">
   <div class="jumbotron" style="background-color:#52697d; padding: 0em 0em 0em;  margin-top:0px; margin-bottom: 0px; background-image: url('http://phdcomics.com/images/bkg_bottom_stuff3.png'); background-repeat: repeat-x; ">
    <div align="center" class="container-fluid" style="max-width: 1800px; padding-left: 0px; padding-right:0px; ">

And then when I wrote

newurl=soup.find('link', {'rel': "image_src"}).get('href')

It gave me what I needed, which is

http://www.phdcomics.com/comics/archive/phd041218s.gif

In the next step, when I want to find the previous page link, I think it’s in the following part of the HTML code –

<!-- Comic Table --!>
        <table border="0" cellspacing="0" cellpadding="0">
          <tr> 
            <td align="right" valign="top">
            <a href=http://phdcomics.com/comics/archive.php?comicid=2004><img height=52 width=49 src=http://phdcomics.com/comics/images/prev_button.gif border=0 align=middle ><br></a><font 
                face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>previous </b></i></font><br><br><a href=http:// phdcomics.com/comics/archive.php?comicid=1995><img src=http://phdcomics.com/comics/images/jump_bck10.gif border=0></a><br><a href=http:// phdcomics.com/comics/archive.php?comicid=2000><img src=http://phdcomics.com/comics/images/jump_bck5.gif border=0></a><br><font face=Arial, Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>jump</b></i></font><br><br><a href=http://phdcomics.com/comics/ archive.php?comicid=1><img src=http://phdcomics.com/comics/images/first_button.gif border=0 align=middle><br></a><font face=Arial,Helvetica,Geneva ,Swiss,SunSans-Regular size=-1><i><b>first</b></i></font><br><br>               </td>
            <td align="center" valign="top"><font color="black"> 

From this part of the code I am looking for

=http://phdcomics.com/comics/archive.php?comicid=2004

As a link from my previous one.
When I try something like this –

Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
print(Prevlink)

It gives me errors like this-

Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'

Even if I try to do it –

Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print(Prevlink)

I get a similar error –

Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'

What should be the right way to get the right ‘href’?
Time difference

Solution

The problem is how to add comments on the html of Phd comics.
If you look closely at the output of soup.prettify(), you’ll notice comments like this

<!-- Comic Table --!>

When it should,

<!-- Comic Table -->

This causes BeautifulSoup to miss certain tags. There are many ways to parse and remove comments, such as using regular expressions, comments, but it can be difficult to get them to work in this case. The easiest way is to fix the comment tag after collecting the html.

from bs4 import BeautifulSoup
import requests
url = "https://phdcomics.com/"
r  = requests.get(url)
data = r.text
data = data.replace("--!>","-->") # fix comments
soup = BeautifulSoup(data)
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print Prevlink
http://phdcomics.com/comics/archive.php?comicid=2004

Update:
To automatically find the requested link, we need to find the “http://phdcomics.com/comics/images/prev_button.gif and extract the link

img_tag = soup.find('img',{'src':'http://phdcomics.com/comics/images/prev_button.gif'})
print img_tag.find_parent().get('href')
http://phdcomics.com/comics/archive.php?comicid=2005

Related Problems and Solutions