Python – How to connect to HDFS using pyarrow in python

How to connect to HDFS using pyarrow in python… here is a solution to the problem.

How to connect to HDFS using pyarrow in python

I have pyarrow installed and want to connect to an hdfs file in a Hadoop cluster. I have the following line and it gives me the error.

 fs = pa.hdfs.connect(host='...', 50057, user='...', kerb_ticket='/tmp/krb5cc_0')

This is the error message I received

ArrowIOError: Unable to load libhdfs

How do I install libhdfs? What other dependencies/settings do I need to do?

Solution

pyarrow.hdfs.connect(host='default', port=0, user=None, kerb_ticket=None, driver='libhdfs', extra_conf=None)

You must make sure that libhdfs.so is in $HADOOP_HOME/lib/native and $ARROW_LIBHDFS_DY

FOR HADOOP:

bash-3.2$ ls $ARROW_LIBHDFS_DIR
examples libhadoop.so.1.0.0 libhdfs.a libnativetask.a
libhadoop.a libhadooppipes.a libhdfs.so libnativetask.so
libhadoop.so libhadooputils.a libhdfs.so.0.0.0 libnativetask.so.1.0.0

As far as I know, the last version is Hadoop 3.2.0

You can use DistributedCache to load any native shared library to distribute and symbolic link library files.

This example shows you how to distribute a shared library mylib.so and load it from a MapReduce task. see informations

  1. First copy the library to HDFS:bin/hadoop fs -copyFromLocal mylib.so.1 /libraries/mylib.so.1

  2. The job launcher should contain the following:

    DistributedCache.createSymlink(conf);
    DistributedCache.addCacheFile(“hdfs://host:port/libraries/mylib.so.
    1#mylib.so”, conf);

  3. The MapReduce task can contain: System.loadLibrary("mylib.so");

Note: If you downloaded or built the native hadoop library, you do not need to use DistibutedCache to make the library available to your MapReduce task.

Related Problems and Solutions