Scrapy spider outputs empy csv files
This is my first question here, I’m learning how to code myself, so please be patient.
I’m working on a final CS50 project, and I’m trying to build a website that might aggregate online Western classes from edx.org and other open online classes. I’m using the scrapy framework to remove filtered results from edx.org upper Western classes… This is my first scrapy spider, and I’m trying to get it in each class link and then get its name (and the description, class URL, and more after I get the right code).
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
class Course_item(Item):
name = Field()
#description = Field()
#img_url = Field()
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['https://www.edx.org/']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
item = ItemLoader(Course_item, response)
item.add_xpath('name', '//*[@id="course-intro-heading"]/text()')
yield item.load_item()
When I run the spider with “scrapy runspider edxSpider.py -o edx.csv -t csv”, I get an empty csv file, which I also think is not going into the correct Western class result.
Basically I want to go into each class of this linkedx Spanish courses and get the name, description, provider, page url, and img url.
Any ideas on why it could be the problem?
Solution
You can’t get edx
content with a simple request, it uses JavaScript rendering to dynamically fetch class elements, so CrawlSpider
won’t work in this case because you need to find a specific element in the response body to generate a new request to get what you need.
The
real request (the URL to get the class) is this one, but you need to generate it from the previous response body (although you can just access it and get the right data).
So, to generate a real request, you need data inside the script
tag:
from scrapy import Spider
import re
import json
class Course_spider(Spider):
name = 'CourseSpider'
allowed_domains = ['edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
def parse(self, response):
script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
json_data = json.loads(parseable_json_data)
...
Now that you have what you need on your json_data
, just create the string URL.