Python regular expression match line if what ends?

Python regular expression match line if what ends? … here is a solution to the problem.

Python regular expression match line if what ends?

That’s what I’m going to crawl:

        <p>Some.Title.html<br />
<a href="https://www.somelink.com/yep.html" rel="nofollow">https://www.somelink.com/yep.html</a><br />
Some.Title.txt<br />
<a href="https://www.somelink.com/yeppers.txt" rel="nofollow">https://www.somelink.com/yeppers.txt</a><br />

I tried several variants:

match = re.compile('^(.+?) <br \/><a href="https://www.somelink.com(.+?)" >',re. DOTALL).findall(html)

I want to match rows with and without “p” labels. The “p” tag only appears in the first instance. It’s terrible for python, so I’m rusty, and after searching here and on Google, nothing seems exactly the same. Thank you for your help. Really appreciated the help I received when I was in a difficult situation.

The expected output is an index:

<a href="Some.Title.html">http://www. SomeLink.com/yep.html</a>
<a href="Some.Title.txt">http://www. SomeLink.com/yeppers.txt</a>

Solution

Using the Beautiful soup and requests modules is great for things like this, as opposed to the regular expressions mentioned by the reviewer above.

import requests
import bs4

html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4. BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them

This is just a simple code that will select all the tags from the html site and store them in a list in the format shown above. I recommend checking here for bs4 and great tutorial here for actual documentation.

Related Problems and Solutions