Python - How to connect to HDFS using pyarrow in python

How to connect to HDFS using pyarrow in python… here is a solution to the problem.

How to connect to HDFS using pyarrow in python

I have pyarrow installed and want to connect to an hdfs file in a Hadoop cluster. I have the following line and it gives me the error.

 fs = pa.hdfs.connect(host='...', 50057, user='...', kerb_ticket='/tmp/krb5cc_0')

This is the error message I received

ArrowIOError: Unable to load libhdfs

How do I install libhdfs? What other dependencies/settings do I need to do?

Solution

pyarrow.hdfs.connect(host='default', port=0, user=None, kerb_ticket=None, driver='libhdfs', extra_conf=None)

You must make sure that libhdfs.so is in $HADOOP_HOME/lib/native and $ARROW_LIBHDFS_DY

FOR HADOOP:

bash-3.2$ ls $ARROW_LIBHDFS_DIR
examples libhadoop.so.1.0.0 libhdfs.a libnativetask.a
libhadoop.a libhadooppipes.a libhdfs.so libnativetask.so
libhadoop.so libhadooputils.a libhdfs.so.0.0.0 libnativetask.so.1.0.0

As far as I know, the last version is Hadoop 3.2.0

You can use DistributedCache to load any native shared library to distribute and symbolic link library files.

This example shows you how to distribute a shared library mylib.so and load it from a MapReduce task. see informations

First copy the library to HDFS:bin/hadoop fs -copyFromLocal mylib.so.1 /libraries/mylib.so.1
The job launcher should contain the following:
DistributedCache.createSymlink(conf);
DistributedCache.addCacheFile(“hdfs://host:port/libraries/mylib.so.
1#mylib.so”, conf);
The MapReduce task can contain: System.loadLibrary("mylib.so");

Note: If you downloaded or built the native hadoop library, you do not need to use DistibutedCache to make the library available to your MapReduce task.

Python – How to connect to HDFS using pyarrow in python

How to connect to HDFS using pyarrow in python

Solution

Related Problems and Solutions