Java – MapReduce and download files from external sources

MapReduce and download files from external sources… here is a solution to the problem.

MapReduce and download files from external sources

I have a project that requires files to be downloaded from external sources in a distributed manner. We’ve invested heavily in Hadoop and want to take advantage of MapReduce — but more as a distributed task than ETL.

1) Has anyone done this before?

2) Should there be only one Mapper and no Reducer?

3) What is the best way to pass an abstract implementation of an FTP/HTTP connection to the mapper? — To be clear, I mean I want a good way to unit test this without integration testing, so I need a way to emulate FTP/HTTP.

4) Is MapReduce the best way to handle this kind of thing? — Are we abusing MapReduce?

Thank you.

Solution

This “sounds” similar to what Nutch did (although I’m not too familiar with Nutch other than that statement).

Some observers:

    If you have multiple URLs hosted by the

  • same server, you might actually benefit from partitioning by hostname and then pulling in Reducer (depending on the number of URLs you pull from).
  • If the content

  • is “cacheable” and you will be pulling content from the same URL over and over again, then you “probably” benefit from having a cache/proxy server between your hadoop cluster and the internet (your company and ISP may/should already be doing this). Although if you are visiting a unique URL or the content is dynamic, this can actually hinder you because you have a bottleneck in the cache/proxy server

Related Problems and Solutions