Python – Scrapy spider outputs empy csv files

Scrapy spider outputs empy csv files… here is a solution to the problem.

Scrapy spider outputs empy csv files

This is my first question here, I’m learning how to code myself, so please be patient.

I’m working on a final CS50 project, and I’m trying to build a website that might aggregate online Western classes from edx.org and other open online classes. I’m using the scrapy framework to remove filtered results from edx.org upper Western classes… This is my first scrapy spider, and I’m trying to get it in each class link and then get its name (and the description, class URL, and more after I get the right code).

from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader

class Course_item(Item):
    name = Field()
    #description = Field()
    #img_url = Field()

class Course_spider(CrawlSpider):
    name = 'CourseSpider'
    allowed_domains = ['https://www.edx.org/']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)

def parse_item(self, response):
        item = ItemLoader(Course_item, response)
        item.add_xpath('name', '//*[@id="course-intro-heading"]/text()')

yield item.load_item()

When I run the spider with “scrapy runspider edxSpider.py -o edx.csv -t csv”, I get an empty csv file, which I also think is not going into the correct Western class result.

Basically I want to go into each class of this linkedx Spanish courses and get the name, description, provider, page url, and img url.

Any ideas on why it could be the problem?

Solution

You can’t get edx content with a simple request, it uses JavaScript rendering to dynamically fetch class elements, so CrawlSpider won’t work in this case because you need to find a specific element in the response body to generate a new request to get what you need.

The

real request (the URL to get the class) is this one, but you need to generate it from the previous response body (although you can just access it and get the right data).

So, to generate a real request, you need data inside the script tag:

from scrapy import Spider
import re
import json

class Course_spider(Spider):
    name = 'CourseSpider'
    allowed_domains = ['edx.org']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

def parse(self, response):
        script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
        parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
        json_data = json.loads(parseable_json_data)
        ...

Now that you have what you need on your json_data, just create the string URL.

Related Problems and Solutions