Python regular expression match line if what ends?
That’s what I’m going to crawl:
<p>Some.Title.html<br /> <a href="https://www.somelink.com/yep.html" rel="nofollow">https://www.somelink.com/yep.html</a><br /> Some.Title.txt<br /> <a href="https://www.somelink.com/yeppers.txt" rel="nofollow">https://www.somelink.com/yeppers.txt</a><br />
I tried several variants:
match = re.compile('^(.+?) <br \/><a href="https://www.somelink.com(.+?)" >',re. DOTALL).findall(html)
I want to match rows with and without “p” labels. The “p” tag only appears in the first instance. It’s terrible for python, so I’m rusty, and after searching here and on Google, nothing seems exactly the same. Thank you for your help. Really appreciated the help I received when I was in a difficult situation.
The expected output is an index:
<a href="Some.Title.html">http://www. SomeLink.com/yep.html</a> <a href="Some.Title.txt">http://www. SomeLink.com/yeppers.txt</a>
Using the Beautiful soup and requests modules is great for things like this, as opposed to the regular expressions mentioned by the reviewer above.
import requests import bs4 html_site = 'www.google.com' #or whatever site you need scraped site_data = requests.get(html_site) # downloads site into a requests object site_parsed = bs4. BeautifulSoup(site_data.text) #converts site text into bs4 object a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them
This is just a simple code that will select all the tags from the html site and store them in a list in the format shown above. I recommend checking here for bs4 and great tutorial here for actual documentation.