Python – div Data scraping across

Data scraping across… here is a solution to the problem.

Data scraping across

I’m trying to extract information from duplicate rows that contain many embedded rows. For that page, I’m trying to write a crawler to start with this Get the various elements. Page. For some reason, I can’t find a way to get the label using a class that contains every line of information. Also, I can’t isolate the parts needed to extract the information. As a reference, here is an example of a one-line line:

<div id="dTeamEventResults" class="col-md-12 team-event-results"><div>
    <div class="row team-event-result team-result">
        <div class="col-md-12 main-info">
            <div class="row">
                <div class="col-md-7 event-name">
                    <dl>
                        <dt>Team Number:</dt> 
                        <dd><a href="/team-event-search/team?program=JFLL&amp; year=2017&amp; number=11733" class="result-name">11733</a></dd>
                        <dt>Team:</dt> 
                        <dd> Aqua Duckies</dd>
                        <dt>Program:</dt> 
                        <dd>FIRST LEGO League Jr.</dd>
                    </dl>
                </div>

The script I started building looks like this:

from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

rows = page_soup.findAll("div", {"class":"row team-event-result team-result"})

Whenever I run len(rows), the result is always 0. I seem to have hit a wall and am in trouble. Thanks for your help!

Solution

The content of this page is dynamically generated, so you need to use any browser emulator, such as Selenium. This is the script that will get what you want. Give it a try:

from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver. Chrome()
driver.get('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017')
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('.main-info'):
    docs = ' '.join([' '.join([item.text,' '.join(val.text.split())]) for item,val in zip(items.select(".event-name dt"),items.select(".event-name dd"))])
    location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-location-type address")])
    print("Event_Info: {}\nEvent_Location: {}\n".format(docs,location))
driver.quit()

The result is similar to:

Event_Info: Team Number: 11733 Team: Aqua Duckies Program: FIRST LEGO League Jr.
Event_Location: Sparta, NJ 07871 USA

Event_Info: Team Number: 4281 Team: Bulldogs Program: FIRST Robotics Competition
Event_Location: Somerset, NJ 08873 USA

Related Problems and Solutions