Java – Nutch about reading EMR from S3

Nutch about reading EMR from S3… here is a solution to the problem.

Nutch about reading EMR from S3

Hello, I’m trying to run Apache Nutch 1.2 on Amazon’s EMR.
To do this, I specified an input directory from S3. I get the following error:

Fetcher: java.lang.IllegalArgumentException:
    This file system object (hdfs://ip-11-202-55-144.ec2.internal:9000)
    does not support access to the request path 
    's3n://crawlResults2/segments/20110823155002/crawl_fetch'
    You possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path.

I understand the difference between FileSystem.get(

uri, conf) and FileSystem.get(conf). If I wrote this myself, I would FileSystem.get(uri, conf) but I’m trying to use existing Nutch code.

I

asked this question and I was told that the hadoop-site.xml needed to be modified to include the following properties: fs.default.name, fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey. I updated these properties in core-site.xml (hadoop-site.xml doesn’t exist), but it’s no different. Does anyone have any other ideas?
Thanks for your help.

Solution

Try

to specify in

hadoop-site.xml

<property>
  <name>fs.default.name</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

This mentions to Nutch that S3 should be used by default

Properties

fs.s3.awsAccessKeyId
and
fs.s3.awsSecretAccessKey

A specification that is required only if your S3 object is under authentication (in S3, all users can access the object or can only be authenticated).

Related Problems and Solutions