Python – NULL pointer exception when trying to get data from S3 using pyspark

NULL pointer exception when trying to get data from S3 using pyspark… here is a solution to the problem.

NULL pointer exception when trying to get data from S3 using pyspark

When I try to get data from S3 using pyspark, I get a null pointer exception. I’m running Spark 1.6.1 with Hadoop 2.4.
I tried using both s3n and s3a.
Also try setting the configuration by:

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl",     "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "aws-key")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "aws-secret-key")

Make sure that the bucket has the permissions of the authenticated user.

>>> myRDD = sc.textFile("s3n://aws-key:aws-secret-key@my-bucket/data.csv-000").count()

16/11/10 18:37:50 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 157.2 KB, free 1755.2 KB)
16/11/10 18:37:50 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 17.0 KB, free 1772.2 KB)
16/11/10 18:37:50 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:61806 (size: 17.0 KB, free: 510.9 MB)
16/11/10 18:37:50 INFO SparkContext: Created broadcast 10 from textFile at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 1004, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 995, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 869, in fold
    vals = self.mapPartitions(func).collect()
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 771, in collect
    port = self.ctx._jvm. PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__

File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NullPointerException
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
    at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala. Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala. Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala. Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j. Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j. GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

Solution

It’s unclear what caused the failure; ine where the exception was raised doesn’t show anything obvious.

My suggestion is to switch to s3a, which is the S3 connector we are currently maintaining in our ASF project; S3n is left alone as a 100% bug-for-bug backward compatible connector.

s3a won’t work because it’s not in Hadoop-2.4; It came with Hadoop-2.6 and reached production-ready status with Hadoop 2.7.1. Capture a version of Spark built against it and you should get a better view of your life. And if not: you can submit bug reports against issues.apache.org that are not turned off as WONTFIX.

Postscript. If you have already set properties in your configuration, you do not need to include AWS user:secret in the URL; This will help keep your secret out of the logs.

Related Problems and Solutions