NULL pointer exception when trying to get data from S3 using pyspark
When I try to get data from S3 using pyspark, I get a null pointer exception. I’m running Spark 1.6.1 with Hadoop 2.4.
I tried using both s3n and s3a.
Also try setting the configuration by:
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "aws-key")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "aws-secret-key")
Make sure that the bucket has the permissions of the authenticated user.
>>> myRDD = sc.textFile("s3n://aws-key:aws-secret-key@my-bucket/data.csv-000").count()
16/11/10 18:37:50 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 157.2 KB, free 1755.2 KB)
16/11/10 18:37:50 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 17.0 KB, free 1772.2 KB)
16/11/10 18:37:50 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:61806 (size: 17.0 KB, free: 510.9 MB)
16/11/10 18:37:50 INFO SparkContext: Created broadcast 10 from textFile at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 1004, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 995, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 869, in fold
vals = self.mapPartitions(func).collect()
File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 771, in collect
port = self.ctx._jvm. PythonRDD.collectAndServe(self._jrdd.rdd())
File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NullPointerException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala. Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala. Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala. Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j. Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j. GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Solution
It’s unclear what caused the failure; ine where the exception was raised doesn’t show anything obvious.
My suggestion is to switch to s3a, which is the S3 connector we are currently maintaining in our ASF project; S3n is left alone as a 100% bug-for-bug backward compatible connector.
s3a won’t work because it’s not in Hadoop-2.4; It came with Hadoop-2.6 and reached production-ready status with Hadoop 2.7.1. Capture a version of Spark built against it and you should get a better view of your life. And if not: you can submit bug reports against issues.apache.org that are not turned off as WONTFIX.
Postscript. If you have already set properties in your configuration, you do not need to include AWS user:secret in the URL; This will help keep your secret out of the logs.