Python – How to integrate Scrapy web crawler with Luigi data pipeline?

How to integrate Scrapy web crawler with Luigi data pipeline?… here is a solution to the problem.

How to integrate Scrapy web crawler with Luigi data pipeline?

(Long-term users && first question && nervously asked) is true

I’m currently building a Python backend that will be deployed to a single AWS EC2 instance with the following architecture:


|—- Data Source —–| Temporary storage | – Data processing — | —– DB —- |

Web crawler data —- saved to S3 = \
API data —————- saved to S3* ==> Lugi Data Pipeline –> MongoDB


As shown above, we have different ways to get the data (i.e. API requests, scrapy web crawlers, etc.), but the tricky/difficult part is coming up with an easy and fault-tolerant way to connect the received data using the Luigi data pipeline.

Is there a way to integrate the output of a web crawler into the Luigi data pipeline? If not, what’s the best way to bridge the gap between HTTP data getters and Luigi tasks?

Any suggestions, documentation or articles would be appreciated! Also, if you need more details, I will provide them to you as soon as possible.

Thanks!

Solution

I’ve never used a luigi. But I do use scrapy. I guess the real question is how do you notify luigi in a reasonable way that there is new data to process?

You can learn about similar problems from here: When a new file arrives in S3, trigger luigi task
Maybe you work in the same place :).

I highly recommend hosting your spider in scrapyd and using scrapyd-client to drive it. If you try to run scrapy in another tool that uses the warp library (not sure if luigi does that), all sorts of furry stuff pops up. I’ll use scrapyd-client to drive the spider and get your spider to post to a trigger URL that tells Luigi to start the task somehow.

Again, because I haven’t used luigi, I don’t know the details there… But you don’t want to be busy checking/polling to see if the work is done.

I

have a Django web application, I launch a spider, store the jobid from scrapyd-client, click a JSON on my shoulder when it’s done, and I use Celery and Solr to ingest the data.

Edit to include the pipeline code from the following comment:

        for fentry in item['files']:

# open and read the file
            pdf = open(rootdir+os.path.sep+fentry['path'],'rb').read()

# just in case we need cookies
            cj     = CookieJar()
            opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

# set the content type
            headers = {
            'Content-Type': 'application/json'
            }

#fill out the object
            json_body = json.dumps({
            'uid'   : 'indeed-'+item['indeed_uid'],
            'url'   : item['candidate_url'],
            'first_name' : fname,
            'last_name'  : lname,
            'pdf'     : base64.b64encode(pdf).decode(),
            'jobid': spider.jobid
            }).encode()
            #, ensure_ascii=False)

# send the POST and read the result
            request = urllib.request.Request('http://localhost:8080/api/someapi/', json_body, headers)
            request.get_method = lambda: 'POST'
            response = opener.open(request)

Related Problems and Solutions